Closed gcalabria closed 1 year ago
Hi @g-lopes, thank you for your kind words! Im glad I can help :)
It looks like your plots converging much slower than mine.. May I ask which setting you are exactly running? like msmarco or xorqa dataset? Are you running DSI-QG or the original DSI? how many gpus and batch sizes are used?
That was quick :)
First, I generated the queries using the following command:
python3 -m torch.distributed.launch --nproc_per_node=8 run.py \
--task generation \
--model_name castorini/doc2query-t5-large-msmarco \
--per_device_eval_batch_size 32 \
--run_name docTquery-XORQA-generation \
--max_length 256 \
--valid_file data/xorqa_data/100k/xorqa_corpus.tsv \
--output_dir temp \
--dataloader_num_workers 10 \
--report_to wandb \
--logging_steps 100 \
--num_return_sequences 10
After that, I started training the model running:
python3 -m torch.distributed.launch --nproc_per_node=8 run.py \
--task "DSI" \
--model_name "google/mt5-base" \
--run_name "XORQA-100k-mt5-base-DSI-QG" \
--max_length 32 \
--train_file data/xorqa_data/100k/xorqa_corpus.tsv.q10.docTquery \
--valid_file data/xorqa_data/100k/xorqa_DSI_dev_data.json \
--output_dir "models/XORQA-100k-mt5-base-DSI-QG" \
--learning_rate 0.0005 \
--warmup_steps 100000 \
--per_device_train_batch_size 32 \
--per_device_eval_batch_size 32 \
--evaluation_strategy steps \
--eval_steps 1000 \
--max_steps 500000 \
--save_strategy steps \
--dataloader_num_workers 10 \
--save_steps 1000 \
--save_total_limit 2 \
--load_best_model_at_end \
--gradient_accumulation_steps 1 \
--report_to wandb \
--logging_steps 100 \
--dataloader_drop_last False \
--metric_for_best_model Hits@10 \
--greater_is_better True \
--remove_prompt True
I am running DSI-QG, with 8 Nvidia A100 GPUs and batch_size equals to 32.
Hi @g-lopes , according to your scripts, I think in step 2 you made the mistake of using castorini/doc2query-t5-large-msmarco
, which is a English only QG model, to generate cross-lingual queries for XORQA dataset.
You need to do the step 1 to train a cross-lingual QG model for step 2. Otherwise, you may want to test with ms marco dataset.
Or, you can try our trained cross-lingual QG models which I just uploaded on huggingface recently: https://huggingface.co/ielabgroup/xor-tydi-docTquery-mt5-large However, I used a slightly different prompt for training this model (check out the model card in the above page), if you want to directly use this model in step 2, you need to change the prompt in this line to match the prompt in the model card.
😃 thank you very much for your help. I will try to follow your instructions :) cheers!
@ArvinZhuang I've double-checked the scripts that I was using and I've achieved much better results.
As you can see, I've achieved a hits@10 score of 0.8, and the training is not finished yet 😄
The results above were obtained using the MSMARCO dataset with 100k data points, 0 queries per document were generated using the castorini/doc2query-t5-large-msmarco
model.
I am trying to convert the hits@10 scores that I have to absolute values so that I can compare my results with the ones in your paper. Can you please explain to me how the hit scores of Table 1 were computed?
I am multiplying my score by the number of documents in the evaluation dataset. In my case, the evaluation dataset has 6980
and the score is 0.8
, so this would result in a score of 5504
, which makes no sense to me.
Thanks for your help one more time :)
@g-lopes Hi, the code for computing hits scores is in this function https://github.com/ArvinZhuang/DSI-QG/blob/479d8d74f9b6c83de99a6723d769e23f1402a8ba/run.py#L43
if you want to do inference with saved model checkpoints, you can try something like this https://github.com/ArvinZhuang/DSI-QG/issues/1#issuecomment-1343063412
@g-lopes Hi, the code for computing hits scores is in this function
https://github.com/ArvinZhuang/DSI-QG/blob/479d8d74f9b6c83de99a6723d769e23f1402a8ba/run.py#L43
if you want to do inference with saved model checkpoints, you can try something like this #1 (comment)
I saw this function. My idea of multiplying by the number of documents in the evaluation dataset came from line 61: https://github.com/ArvinZhuang/DSI-QG/blob/479d8d74f9b6c83de99a6723d769e23f1402a8ba/run.py#L61
Did the results from Table 1 of your paper come directly from this function? Or have you also multiplied by the size of the evaluation set?
Which test dataset did you use to generate these results?
Ah, I see what you mean, yes my results are directly from this function, the numbers in the table are simply percentages, which you just need to multiply the numbers from the function by 100. @g-lopes So I think you have successfully reproduced my results :)
Aha! Great! Thank you very much :D
First of all, I would like to thank you for making the code for work available and I also say that I really liked your paper. It is very interesting.
I am currently writing my master's thesis and I would like to use part of your code to build my own mathematical IR system. So, the first thing I've done is to try to run your scripts and see if my results match yours.
So what I did was to run the
get_data.sh
script and then the scripts of steps 2 and 3 of the README file. The model is still being trained but I am skeptical about the results I am getting until now.Can you please confirm if the graphs below are in accordance with your results?
Thank you for your attention :)