amazon-science / robust-tableqa

Two approaches for robust TableQA: 1) ITR is a general-purpose retrieval-based approach for handling long tables in TableQA transformer models. 2) LI-RAGE is a robust framework for open-domain TableQA which addresses several limitations. (ACL 2023)
Other
29 stars 4 forks source link

Cannot reproduce the paper results #5

Open wangzhen263 opened 3 months ago

wangzhen263 commented 3 months ago

Hi,

I follow your scripts to train and test the model. But I could not reproduce your paper results. The results are far below the paper results. Could you help me reproduce the paper results?

I ran the following two scripts:

Train: python src/main.py configs/nq_tables/colbert.jsonnet --accelerator gpu --devices 2 --strategy ddp --num_sanity_val_steps 2 --experiment_name ColBERT_NQTables_bz4_negative4_fix_doclen_full_search_NewcrossGPU --mode train --override --opts train.batch_size=6 train.scheduler=None train.epochs=1000 train.lr=0.00001 train.additional.gradient_accumulation_steps=4 train.additional.warmup_steps=0 train.additional.early_stop_patience=10 train.additional.save_top_k=3 valid.batch_size=32 test.batch_size=32 valid.step_size=200 data_loader.dummy_dataloader=0 reset=1 model_config.num_negative_samples=4 model_config.bm25_top_k=5 model_config.bm25_ratio=0 model_config.nbits=2

Test: python src/main.py configs/nq_tables/colbert.jsonnet --accelerator gpu --devices 1 --strategy ddp --experiment_name ColBERT_NQTables_bz4_negative4_fix_doclen_full_search_NewcrossGPU --mode test --test_evaluation_name nq_tables_all --opts test.batch_size=32 test.load_epoch=5427 model_config.nbits=8

Output results: [INFO] - trainers.metrics_processors : Running metrics {'name': 'compute_RAG_retrieval_results'}... [INFO] - trainers.metrics_processors : Running metrics {'name': 'compute_token_f1'}... {'official_sets_test/RAGNQTablesDataset.test/denotation_accuracy': 0.0031282586027111575, 'official_sets_test/RAGNQTablesDataset.test/epoch': 0, 'official_sets_test/RAGNQTablesDataset.test/n_retrieved_docs': 5, 'official_sets_test/RAGNQTablesDataset.test/precision': 0.04275286757038582, 'official_sets_test/RAGNQTablesDataset.test/recall': 0.21376433785192908, 'official_sets_test/RAGNQTablesDataset.test/token_f1': 0.0020855057351407717} {'predictions/step_0_MODE(test)_SET(official_sets_test/RAGNQTablesDataset.test)_rank(0)': <wandb.data_types.Table object at 0x7f65332f5490>} [INFO] - trainers.RAG_executor : Evaluation results [test]: {'official_sets_test/RAGNQTablesDataset.test/denotation_accuracy': 0.0031282586027111575, 'official_sets_test/RAGNQTablesDataset.test/recall': 0.21376433785192908, 'official_sets_test/RAGNQTablesDataset.test/precision': 0.04275286757038582, 'official_sets_test/RAGNQTablesDataset.test/n_retrieved_docs': 5, 'official_sets_test/RAGNQTablesDataset.test/token_f1': 0.0020855057351407717, 'official_sets_test/RAGNQTablesDataset.test/epoch': 0} Testing DataLoader 1: 100%|███████████████████████████████████████████████████████████████████████| 959/959 [07:27<00:00, 2.14it/s] ──────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────── official_sets_test/RAGNQTablesDataset.test/denotation_accuracy 0.0031282585114240646 official_sets_test/RAGNQTablesDataset.test/epoch 0.0 official_sets_test/RAGNQTablesDataset.test/n_retrieved_docs 5.0 official_sets_test/RAGNQTablesDataset.test/precision 0.04275286942720413 official_sets_test/RAGNQTablesDataset.test/recall 0.21376433968544006 official_sets_test/RAGNQTablesDataset.test/token_f1 0.0020855057518929243 official_sets_test/RAGNQTablesDataset.validation/denotation_accura 0.0 cy official_sets_test/RAGNQTablesDataset.validation/epoch 0.0 official_sets_test/RAGNQTablesDataset.validation/n_retrieved_docs 5.0 official_sets_test/RAGNQTablesDataset.validation/precision 0.04517338424921036 official_sets_test/RAGNQTablesDataset.validation/recall 0.2258669137954712 official_sets_test/RAGNQTablesDataset.validation/token_f1 0.0 ──────────────────────────────────────────────────────────── Test metric DataLoader 1 ──────────────────────────────────────────────────────────── official_sets_test/RAGNQTablesDataset.test/denotation_accuracy 0.0031282585114240646 official_sets_test/RAGNQTablesDataset.test/epoch 0.0 official_sets_test/RAGNQTablesDataset.test/n_retrieved_docs 5.0 official_sets_test/RAGNQTablesDataset.test/precision 0.04275286942720413 official_sets_test/RAGNQTablesDataset.test/recall 0.21376433968544006 official_sets_test/RAGNQTablesDataset.test/token_f1 0.0020855057518929243 official_sets_test/RAGNQTablesDataset.validation/denotation_accura 0.0 cy official_sets_test/RAGNQTablesDataset.validation/epoch 0.0 official_sets_test/RAGNQTablesDataset.validation/n_retrieved_docs 5.0 official_sets_test/RAGNQTablesDataset.validation/precision 0.04517338424921036 official_sets_test/RAGNQTablesDataset.validation/recall 0.2258669137954712 official_sets_test/RAGNQTablesDataset.validation/token_f1 0.0 ────────────────────────────────────────────────────────────

LinWeizheDragon commented 3 months ago

Hi, the numbers indicate that something went wrong. Below is the test set performance during training (my reproduction before publishing the codes) image

I don't see a problem in the script you just ran. Could you please start by checking carefully if every component (dataset, preprocessing, data sent into the model, the ColBERT engine, and the evaluation) ran as expected?

wangzhen263 commented 3 months ago

Hi, do you use the original pre-trained colbertv2.0 ? TableQA_data/checkpoints/colbertv2.0. Do you download it from here https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file?

Furthermore, in the config/nq_tables/frozen_rag.jsonnet. local index_files = { "index_passages_path": "DPR_NQTables_train_bz8_gc_4_crossGPU/test/nq_tables_all/step_2039/table_dataset", "index_path": "DPR_NQTables_train_bz8_gc_4_crossGPU/test/nq_tables_all/step_2039/table_dataset_hnsw_index.faiss", };

How do you get "step_2039"? After running test script, I only get step_0.

LinWeizheDragon commented 3 months ago

Yes.

It is possibly because pytorch-lightning changed its design in an upgrade some time ago. When loading a checkpoint for testing, it will no longer load global_step from the checkpoint, leading to the 0 here. If you are sure the correct checkpoint is loaded, then this should not be a problem.

wangzhen263 commented 3 months ago

I think the trainer maybe is not attached. That is why the global_step is 0. And it could be why my test metrics are low.

https://pytorch-lightning.readthedocs.io/en/1.8.6/api/pytorch_lightning.core.LightningModule.html?highlight=global_step#pytorc

Screenshot 2024-03-19 at 12 00 49 pm

h_lightning.core.LightningModule.global_step

I am not familiar with pytorch-lightning.