Reproducibility Problem

hi-i-m-GTooth commented 7 months ago

Hi, Dr. Zhuang. Sorry to bother you again in a short period.

Here is my question: I tried to reproduce DSI-QG's HIT@1 on XOR QA 100k, but the results differ from the paper. Supposedly, the HIT@1 curve should be like Fig. 2 in the paper.

Limited to computation power, I trained models with *A6000 GPU (48G) 2. Since the cuda version is 12.2, I installed Pytorch 2.1.0. The server OS is Ubuntu 22.04.3 LTS with 64-core CPU**. Below are the scripts I executed:

1. Generate Queries with Given QG Model Checkpoints

With QG model ckpt provided by Dr. Zhuang, I downloaded it (xor-tydi-docTquery-mt5-large) and placed it in models dir. Then I run the below script just like step 2 from README:

LOCAL_RANK=0,1 python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 29501 --use-env run.py \
                --task generation \
                --model_name google/mt5-large \
                --model_path models/xor-tydi-docTquery-mt5-large \
                --per_device_eval_batch_size 16 \
                --run_name docTquery-XORQA-generation \
                --max_length 256 \
                --valid_file data/xorqa_data/100k/xorqa_corpus.tsv \
                --output_dir temp \
                --dataloader_num_workers 10 \
                --report_to wandb \
                --logging_steps 100 \
                --num_return_sequences 10

2. Train DSI-QG with Query-represented Corpus

After executing above script, I tried to train DSI-QG model. Here is the script refer to step 3 from README:

LOCAL_RANK=0,1 python3 -m torch.distributed.launch --nproc_per_node 2 --master_port 29500 --use-env run.py \
        --task "DSI" \
        --model_name "google/mt5-base" \
        --run_name "XORQA-100k-mt5-base-DSI-QG" \
        --max_length 32 \
        --train_file data/xorqa_data/100k/xorqa_corpus.tsv.q10.docTquery \
        --valid_file data/xorqa_data/100k/xorqa_DSI_dev_data.json \
        --output_dir "models/XORQA-100k-mt5-base-DSI-QG" \
        --learning_rate 0.0005 \
        --warmup_steps 100000 \
        --per_device_train_batch_size 32 \
        --per_device_eval_batch_size 32 \
        --evaluation_strategy steps \
        --eval_steps 1000 \
        --max_steps 1000000 \
        --save_strategy steps \
        --dataloader_num_workers 10 \
        --save_steps 1000 \
        --save_total_limit 2 \
        --load_best_model_at_end \
        --gradient_accumulation_steps 1 \
        --report_to wandb \
        --logging_steps 100 \
        --dataloader_drop_last False \
        --metric_for_best_model Hits@10 \
        --greater_is_better True \
        --remove_prompt True

Questions: The Performance is Significantly Different from the Paper

Though I haven't gone through the whole training procedure, the HIT@1 and HIT@10 scores are both strange so far.

I check my hyperparameters to ensure they follow README. For sure, I didn't edit or modify any codes either. Also, I apply given ckpt to avoid training QG model by myself to acquire more stable queries for DSI-QG model.

Below are the HIT@1, HIT@10 logged by wandb.

Here is the HIT@1 curve in Fig. 2 of the paper:

Hope you give me some comments! Appreciate for your contribution!

ArvinZhuang commented 7 months ago

Hi @hi-i-m-GTooth, your scripts look correct to me. Can you check if the generated queries look ok (data/xorqa_data/100k/xorqa_corpus.tsv.q10.docTquery)?

hi-i-m-GTooth commented 7 months ago

Hi, Dr. Zhuang. Sorry for the late reply.

Below, I try to observe what QG model generated with --num_return_sequences 1 . I take DOC 40693 for example.

Texts of Document 40693

Khalid bin Abdulaziz Al Saud ( ""; 13 February 1913 – 13 June 1982) was King of Saudi Arabia from 1975 to 1982. His reign saw both huge developments in the country due to increase in oil revenues and significant events in the Middle East. Khalid of Saudi Arabia

Generated Queries

Samely, I generated following queries with models/xor-tydi-docTquery-mt5-large.

{"text_id": 40693, "text": "من هو ملك السعودية ؟"}
{"text_id": 40693, "text": "কাতিফের বর্তমান রাষ্ট্রপতি কে?"}
{"text_id": 40693, "text": "Milloin Saud-Arabian kuningattaret olivat vallassa?"}
{"text_id": 40693, "text": "サウジ国王の初代王は誰?"}
{"text_id": 40693, "text": "칼리드 5세의 생일은 언젠가요?"}
{"text_id": 40693, "text": "Когда родился шейх Клуда бен Аблязия́з ал Сауд?"}
{"text_id": 40693, "text": "షేక్ బూదిద్దీన్ ఒబేర్ లాసా ఎప్పుడు మరణించాడు?"}

Which could be translated to:

{"text_id": 40693, "text": "Who is the king of Saudi Arabia?"}
{"text_id": 40693, "text": "Who is the current president of Katif?"}
{"text_id": 40693, "text": "When were the queens of Saudi Arabia in power?"}
{"text_id": 40693, "text": "Who was the first Saudi king?"}
{"text_id": 40693, "text": "When is Khalid V's birthday?"}
{"text_id": 40693, "text": "When was Sheikh Kludah bin Ablyaziaz al Saud born?"}
{"text_id": 40693, "text": "When did Sheikh Budiddeen Obair Lhasa die?"}

Question

I noticed that some words are not presented in DOC 409693, e.g. Katif, queens, Khalid V, Sheikh Kludah, and Sheikh Budiddeen Obair Lhasa. (Since I am not familiar with those languages, I translated them with Google Translate.)

Are they quite different from your / expected generated queries? Thank you!

ArvinZhuang commented 7 months ago

New words are expected as the QG model could introduce new relevant words or just hallucinate. So seems your QG step is correct, then the issue might be in the training step. I feel it might be just the batch size is too small in your case, probably try to set --gradient_accumulation_steps to 4?

hi-i-m-GTooth commented 7 months ago

Hi, Dr. Zhuang.

Thanks for your precious advice. I'll try to train with setting --gradient_accumulation_steps to 4!

By the way, during the discussion, I've trained DSI-QG on MSMARCO-100K Dataset with same process. The result (as the following image shows) is normal, unlike the abovementioned issues. According to #10 , since Mr. gcalabria could reproduce the results, I don't think the ckpt is the problem.

I hope this information will help us to address this issue :)

W B Chart 2024_3_3 下午5_26_20 W B Chart 2024_3_3 下午5_26_55

hi-i-m-GTooth commented 6 months ago

Hi, Dr. Zhuang.

I've tried to train with setting --gradient_accumulation_steps to 4. Unfortunately, I still can't reproduce the experiment for XOR dataset.

If it is acceptable, may I request a docker file containing the environment and scripts for reproduction? I think this could be the most reachable way to fix this problem.

Thank you very much!

ArvinZhuang commented 6 months ago

Hi @hi-i-m-GTooth, unfortunately I do not have an env container, but I dont think the problem comes from the env, there is no tricky env installation. could you share the training loss as well?

here are the training curves I got before:

so in my case, it needs around 100k steps to start learning something, I am not quite sure how wandb logs steps with gradient_accumulation_steps > 1, maybe just wait for a bit longer? also maybe try xorqa 10k to debug (small datasets thus faster).

hi-i-m-GTooth commented 6 months ago

Hi, Dr. Zhuang.

Here is my training loss (--gradient_accumulation_steps = 4):

And this is the training loss when --gradient_accumulation_steps = 1:

They are both just stuck at about 4. I'll try to wait a little bit more to see the result of --gradient_accumulation_steps = 4.

10k dataset will be nice for me to reduce the cost of computation, thanks for the advice.

hi-i-m-GTooth commented 6 months ago

Hi, Dr. Zhuang.

I would like to inform you, thanks to your suggestion, the setting --gradient_accumulation_steps = 4 works.

However, you may notice the converge speed is much slower than yours. Actually, I've trained the model for about 2 weeks with 2 nodes + --gradient_accumulation_steps = 4 setting. This is out of my expectation since the 2*4 nodes should physically meet the original settings for XOR Dataset.

ArvinZhuang commented 6 months ago

Good to see it worked! the convergence speed might be impacted by the sampled docs and generated queries. But I hope it can converge to a similar level.

ArvinZhuang / DSI-QG