Closed pkuliyi2015 closed 1 year ago
Hi @pkuliyi2015 Are you using the msmarco data processing code example in this repo? could you also try t5-base instead of mt5?
Update: I tried t5-base for 1 hour and the result seems to be pessimistic. Here is the score Hit@10, it becomes zero when after 8k steps:
Update: your provided XORQA seems to not converge too. Urgently need your help!
Hi @pkuliyi2015 Sorry to see that, I think it is highly likely due to the batch size. Can you check my old repo which I also used a single v100 gpu, and I set a larger batch size for that.
I adjust the batch size from your 16 to 96 (my V100 cannot store 128. I don't know why). It still doesn't work at all.
I urgently request your help. I have wasted many dollars renting the V100 and didn't get sleep yesterday night. Here is the new params. Would you please, at least have a try, what is an effective set of params?
python3 run.py \ --task "DSI" \ --model_name "t5-base" \ --run_name "XORQA-100k-t5-base-DSI" \ --max_length 256 \ --train_file data/xorqa_data/100k/xorqa_DSI_train_data.json \ --valid_file data/xorqa_data/100k/xorqa_DSI_dev_data.json \ --output_dir "models/XORQA-100k-5-base-DSI" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 64 \ --per_device_eval_batch_size 64 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 10 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True
@pkuliyi2015 for your last configs, you are using t5-base but using XORQA dataset which is a multi-lingual dataset. I suggest you keep t5-base change back to msmarco train and dev. For larger batch size, you can try to set --max_length to 128, then you should be able to use larger batch size
But, in this case, I don't suggest you train the original DSI as it takes lots of computations. In fact, if you check our paper, t5-base is not good for DSI at all. I suggest you directly try our DSI-QG which is converging much faster. And for msmarco example, you don't need to train a query generation model (you can skip step 1), thus not expensive to try.
Update: I'm the same person with another GitHub account.
I'm trying your DSI-QG with the following command:
Step 2: python3 run.py \ --task generation \ --model_name castorini/doc2query-t5-large-msmarco \ --per_device_eval_batch_size 32 \ --run_name docTquery-MSMARCO-generation \ --max_length 256 \ --valid_file data/msmarco_data/100k/msmarco_corpus.tsv \ --output_dir temp \ --dataloader_num_workers 10 \ --report_to wandb \ --logging_steps 100 \ --num_return_sequences 10
The generated data is as follows (last line:)
Step 3:
python3 run.py --task "DSI" \ --model_name "t5-base" \ --run_name "MSMARCO-100k-t5-base-DSI-QG" \ --max_length 32 \ --train_file data/msmarco_data/100k/msmarco_corpus.tsv.q10.docTquery \ --valid_file data/msmarco_data/100k/msmarco_DSI_dev_data.json \ --output_dir "models/MSMARCO-100k-t5-base-DSI-QG" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 128 \ --per_device_eval_batch_size 128 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 10 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True \ --remove_prompt True
I have cleanup all my files, reinstall Python3.8 and your requirements before starting this run. But from the first 3k steps, it seems that the DSI-QG still doesn't work. I'm training on a single V100 now and will update the outcome here.
Would you mind have a try to train on a single V100 card? I have wasted a lot of dollars on this. I hope to find a fast way to reproduce your result, at least, not all zeros in Hit@10.
Would you mind provide your loss curves? So I can find the issue of my training. I have printed my training set and evaluation set; they don't give me any hint.
Hi @LightChaser666 @pkuliyi2015
That looks weird, I'll have a look today.
Hi @LightChaser666 @pkuliyi2015
That looks weird, I'll have a look today.
I believe there may be significant difference in the hyper parameters between single card and multiple cards training. I spend some money renting 8 cards so the code works correctly.
Oh, okay, thanks for testing this. I'm sorry to hear that you spend lots of money on this. I guess how to make DSI training more efficient is a good research direction! I'm closing this issue now since there is no problem with the code.
Dear authors, thanks for your great work!
I'm not major in this field but want to try DSI for my own topics. However, when I tried to train on single 32G V100 GPU, it seems that I do something wrong that, after 200k steps it still doesn't converge. And the Hit@10 keeps to be zeros:
Am I do something wrong? Here is the script I used for training (I removed the distributed training code because I only have one GPU):
python3 run.py \ --task "DSI" \ --model_name "google/mt5-base" \ --run_name "MSMARCO-100k-mt5-base-DSI" \ --max_length 256 \ --train_file data/msmarco_data/100k/msmarco_DSI_train_data.json \ --valid_file data/msmarco_data/100k/msmarco_DSI_dev_data.json \ --output_dir "models/MSMARCO-100k-mt5-base-DSI" \ --learning_rate 0.0005 \ --warmup_steps 100000 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 8 \ --evaluation_strategy steps \ --eval_steps 1000 \ --max_steps 1000000 \ --save_strategy steps \ --dataloader_num_workers 5 \ --save_steps 1000 \ --save_total_limit 2 \ --load_best_model_at_end \ --gradient_accumulation_steps 1 \ --report_to wandb \ --logging_steps 100 \ --dataloader_drop_last False \ --metric_for_best_model Hits@10 \ --greater_is_better True