ArvinZhuang / DSI-QG

The official repository for "Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation", Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon and Daxin Jiang.
MIT License
114 stars 19 forks source link

How many steps does it need to converge for DSI training on MSMARCO? #7

Closed pkuliyi2015 closed 1 year ago

pkuliyi2015 commented 1 year ago

Dear authors, thanks for your great work!

I'm not major in this field but want to try DSI for my own topics. However, when I tried to train on single 32G V100 GPU, it seems that I do something wrong that, after 200k steps it still doesn't converge. And the Hit@10 keeps to be zeros:

image

Am I do something wrong? Here is the script I used for training (I removed the distributed training code because I only have one GPU):

python3 run.py \ --task "DSI" \ --model_name "google/mt5-base" \ --run_name "MSMARCO-100k-mt5-base-DSI" \ --max_length 256 \ --train_file data/msmarco_data/100k/msmarco_DSI_train_data.json \ --valid_file data/msmarco_data/100k/msmarco_DSI_dev_data.json \ --output_dir "models/MSMARCO-100k-mt5-base-DSI" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 8 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 5 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True

ArvinZhuang commented 1 year ago

Hi @pkuliyi2015 Are you using the msmarco data processing code example in this repo? could you also try t5-base instead of mt5?

pkuliyi2015 commented 1 year ago
  1. Yes I’m using your get_data.sh without modification
  2. No I haven’t try t5-base. I will have a try then.
  3. My computational resources is quite limited so every time it takes a long time to train. Would you please provide some configurations for DSI that can work on your side with a single card?
pkuliyi2015 commented 1 year ago

Update: I tried t5-base for 1 hour and the result seems to be pessimistic. Here is the score Hit@10, it becomes zero when after 8k steps:

image

pkuliyi2015 commented 1 year ago

Update: your provided XORQA seems to not converge too. Urgently need your help!

image
ArvinZhuang commented 1 year ago

Hi @pkuliyi2015 Sorry to see that, I think it is highly likely due to the batch size. Can you check my old repo which I also used a single v100 gpu, and I set a larger batch size for that.

pkuliyi2015 commented 1 year ago

I adjust the batch size from your 16 to 96 (my V100 cannot store 128. I don't know why). It still doesn't work at all.

image

I urgently request your help. I have wasted many dollars renting the V100 and didn't get sleep yesterday night. Here is the new params. Would you please, at least have a try, what is an effective set of params?

python3 run.py \ --task "DSI" \ --model_name "t5-base" \ --run_name "XORQA-100k-t5-base-DSI" \ --max_length 256 \ --train_file data/xorqa_data/100k/xorqa_DSI_train_data.json \ --valid_file data/xorqa_data/100k/xorqa_DSI_dev_data.json \ --output_dir "models/XORQA-100k-5-base-DSI" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 64 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 10 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True

ArvinZhuang commented 1 year ago

@pkuliyi2015 for your last configs, you are using t5-base but using XORQA dataset which is a multi-lingual dataset. I suggest you keep t5-base change back to msmarco train and dev. For larger batch size, you can try to set --max_length to 128, then you should be able to use larger batch size

But, in this case, I don't suggest you train the original DSI as it takes lots of computations. In fact, if you check our paper, t5-base is not good for DSI at all. I suggest you directly try our DSI-QG which is converging much faster. And for msmarco example, you don't need to train a query generation model (you can skip step 1), thus not expensive to try.

LightChaser666 commented 1 year ago

Update: I'm the same person with another GitHub account.

I'm trying your DSI-QG with the following command:

Step 2: python3 run.py \ --task generation \ --model_name castorini/doc2query-t5-large-msmarco \ --per_device_eval_batch_size 32 \ --run_name docTquery-MSMARCO-generation \ --max_length 256 \ --valid_file data/msmarco_data/100k/msmarco_corpus.tsv \ --output_dir temp \ --dataloader_num_workers 10 \ --report_to wandb \ --logging_steps 100 \ --num_return_sequences 10

The generated data is as follows (last line:)

image

Step 3:

python3 run.py --task "DSI" \ --model_name "t5-base" \ --run_name "MSMARCO-100k-t5-base-DSI-QG" \ --max_length 32 \ --train_file data/msmarco_data/100k/msmarco_corpus.tsv.q10.docTquery \ --valid_file data/msmarco_data/100k/msmarco_DSI_dev_data.json \ --output_dir "models/MSMARCO-100k-t5-base-DSI-QG" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 128 \ --per_device_eval_batch_size 128 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 10 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True \ --remove_prompt True

I have cleanup all my files, reinstall Python3.8 and your requirements before starting this run. But from the first 3k steps, it seems that the DSI-QG still doesn't work. I'm training on a single V100 now and will update the outcome here.

Would you mind have a try to train on a single V100 card? I have wasted a lot of dollars on this. I hope to find a fast way to reproduce your result, at least, not all zeros in Hit@10.

image
pkuliyi2015 commented 1 year ago

Would you mind provide your loss curves? So I can find the issue of my training. I have printed my training set and evaluation set; they don't give me any hint.

ArvinZhuang commented 1 year ago

Hi @LightChaser666 @pkuliyi2015

That looks weird, I'll have a look today.

pkuliyi2015 commented 1 year ago

Hi @LightChaser666 @pkuliyi2015

That looks weird, I'll have a look today.

I believe there may be significant difference in the hyper parameters between single card and multiple cards training. I spend some money renting 8 cards so the code works correctly.

ArvinZhuang commented 1 year ago

Oh, okay, thanks for testing this. I'm sorry to hear that you spend lots of money on this. I guess how to make DSI training more efficient is a good research direction! I'm closing this issue now since there is no problem with the code.