How many steps does it need to converge for DSI training on MSMARCO?

pkuliyi2015 commented 1 year ago

Dear authors, thanks for your great work!

I'm not major in this field but want to try DSI for my own topics. However, when I tried to train on single 32G V100 GPU, it seems that I do something wrong that, after 200k steps it still doesn't converge. And the Hit@10 keeps to be zeros:

Am I do something wrong? Here is the script I used for training (I removed the distributed training code because I only have one GPU):

python3 run.py \ --task "DSI" \ --model_name "google/mt5-base" \ --run_name "MSMARCO-100k-mt5-base-DSI" \ --max_length 256 \ --train_file data/msmarco_data/100k/msmarco_DSI_train_data.json \ --valid_file data/msmarco_data/100k/msmarco_DSI_dev_data.json \ --output_dir "models/MSMARCO-100k-mt5-base-DSI" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 8 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 5 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True

ArvinZhuang commented 1 year ago

Hi @pkuliyi2015 Are you using the msmarco data processing code example in this repo? could you also try t5-base instead of mt5?

pkuliyi2015 commented 1 year ago

Yes I’m using your get_data.sh without modification
No I haven’t try t5-base. I will have a try then.
My computational resources is quite limited so every time it takes a long time to train. Would you please provide some configurations for DSI that can work on your side with a single card?

pkuliyi2015 commented 1 year ago

Update: I tried t5-base for 1 hour and the result seems to be pessimistic. Here is the score Hit@10, it becomes zero when after 8k steps:

pkuliyi2015 commented 1 year ago

Update: your provided XORQA seems to not converge too. Urgently need your help!

ArvinZhuang commented 1 year ago

Hi @pkuliyi2015 Sorry to see that, I think it is highly likely due to the batch size. Can you check my old repo which I also used a single v100 gpu, and I set a larger batch size for that.

pkuliyi2015 commented 1 year ago

I adjust the batch size from your 16 to 96 (my V100 cannot store 128. I don't know why). It still doesn't work at all.

I urgently request your help. I have wasted many dollars renting the V100 and didn't get sleep yesterday night. Here is the new params. Would you please, at least have a try, what is an effective set of params?

python3 run.py \ --task "DSI" \ --model_name "t5-base" \ --run_name "XORQA-100k-t5-base-DSI" \ --max_length 256 \ --train_file data/xorqa_data/100k/xorqa_DSI_train_data.json \ --valid_file data/xorqa_data/100k/xorqa_DSI_dev_data.json \ --output_dir "models/XORQA-100k-5-base-DSI" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 64 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 10 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True

ArvinZhuang commented 1 year ago

@pkuliyi2015 for your last configs, you are using t5-base but using XORQA dataset which is a multi-lingual dataset. I suggest you keep t5-base change back to msmarco train and dev. For larger batch size, you can try to set --max_length to 128, then you should be able to use larger batch size

But, in this case, I don't suggest you train the original DSI as it takes lots of computations. In fact, if you check our paper, t5-base is not good for DSI at all. I suggest you directly try our DSI-QG which is converging much faster. And for msmarco example, you don't need to train a query generation model (you can skip step 1), thus not expensive to try.

LightChaser666 commented 1 year ago

Update: I'm the same person with another GitHub account.

I'm trying your DSI-QG with the following command:

Step 2: python3 run.py \ --task generation \ --model_name castorini/doc2query-t5-large-msmarco \ --per_device_eval_batch_size 32 \ --run_name docTquery-MSMARCO-generation \ --max_length 256 \ --valid_file data/msmarco_data/100k/msmarco_corpus.tsv \ --output_dir temp \ --dataloader_num_workers 10 \ --report_to wandb \ --logging_steps 100 \ --num_return_sequences 10

The generated data is as follows (last line:)

Step 3:

python3 run.py --task "DSI" \ --model_name "t5-base" \ --run_name "MSMARCO-100k-t5-base-DSI-QG" \ --max_length 32 \ --train_file data/msmarco_data/100k/msmarco_corpus.tsv.q10.docTquery \ --valid_file data/msmarco_data/100k/msmarco_DSI_dev_data.json \ --output_dir "models/MSMARCO-100k-t5-base-DSI-QG" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 128 \ --per_device_eval_batch_size 128 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 10 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True \ --remove_prompt True

I have cleanup all my files, reinstall Python3.8 and your requirements before starting this run. But from the first 3k steps, it seems that the DSI-QG still doesn't work. I'm training on a single V100 now and will update the outcome here.

Would you mind have a try to train on a single V100 card? I have wasted a lot of dollars on this. I hope to find a fast way to reproduce your result, at least, not all zeros in Hit@10.

pkuliyi2015 commented 1 year ago

Would you mind provide your loss curves? So I can find the issue of my training. I have printed my training set and evaluation set; they don't give me any hint.

ArvinZhuang commented 1 year ago

Hi @LightChaser666 @pkuliyi2015

That looks weird, I'll have a look today.

pkuliyi2015 commented 1 year ago

Hi @LightChaser666 @pkuliyi2015

That looks weird, I'll have a look today.

I believe there may be significant difference in the hyper parameters between single card and multiple cards training. I spend some money renting 8 cards so the code works correctly.

ArvinZhuang commented 1 year ago

Oh, okay, thanks for testing this. I'm sorry to hear that you spend lots of money on this. I guess how to make DSI training more efficient is a good research direction! I'm closing this issue now since there is no problem with the code.

ArvinZhuang / DSI-QG

How many steps does it need to converge for DSI training on MSMARCO? #7