castorini / pygaggle

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini
http://pygaggle.ai/
Apache License 2.0
329 stars 97 forks source link

Cannot reproduce "monot5-base-msmarco-10k" via pytorch script #307

Open polgrisha opened 1 year ago

polgrisha commented 1 year ago

Hello!

I am trying to reproduce quality of monoT5 on BEIR benchmark from the recent article. But after running script finetune_monot5.py on one epoch, as stated in the description of the checkpoint "monot5-base-msmarco-10k", my results are quite lower.

For example, on NQ, when I use my checkpoint the result is 0.5596 ndcg@10. But when I use the original checkpoint - 0.5676 ndcg@10. On NFCorpus: 0.3604 ndcg@10 with my checkpoint, 0.3778 ndcg@10 with the original.

So, is one epoch of training monoT5 with pytorch script similar to one epoch of training with TF? And with what hyperparameters can I reproduce performance of "monot5-base-msmarco-10k"?

rodrigonogueira4 commented 1 year ago

Hi @polgrisha, we haven't tested that pytorch script extensively, especially in zero-shot, but it seems that some hyperparameters were wrong.

I opened a PR with the ones we used to train the model on TPUs + TF: https://github.com/castorini/pygaggle/pull/308

Could you please give it a try?

rodrigonogueira4 commented 1 year ago

I was looking at my logs and I was never able to reproduce the results on pytorch+GPU using the same hyperparameters used to finetune on TF+TPUs. The best ones I found were the ones already in the repo.

However, in another project, I found that this configuration gives good results to finetune T5 on PT+GPUs:

--train_batch_size=4 --accumulate_grad_batches=32 --optimizer=AdamW --lr=3e-4 (or 3e-5) --weight_decay=5e-5

Could you please give it a try?

polgrisha commented 1 year ago

@rodrigonogueira4 Thanks for your response

I tried the hyperparams you suggested:

--train_batch_size=4 --accumulate_grad_batches=32 --optimizer=AdamW --lr=3e-5 --weight_decay=5e-5

And so far, the closest result was obtained by training mono-t5 for 9k steps (10k is one epoch with batch_size=4, accum_steps=32 and 2 gpus)

(TREC-COVID: original-0.7845, my-0.7899; NFCorpus: original-0.3778, my-0.3731, NQ: original-0.5676, my-0.5688, FIQA-2018: original: 0.4129, my: 0.4130)

rodrigonogueira4 commented 1 year ago

Hi @polgrisha, thanks for running this experiment. It seems that you go pretty close to the original training in mesh-tensorflow+TPUs.

I expected those small differences in the individual datasets from BEIR, especially since you are using a different optimizer. However, to be really sure, I would run on a few more datasets and compare the average against the results reported in the "No parameter left behind" paper.

zlh-source commented 1 year ago

@rodrigonogueira4 Thanks for your response

I tried the hyperparams you suggested:

--train_batch_size=4 --accumulate_grad_batches=32 --optimizer=AdamW --lr=3e-5 --weight_decay=5e-5

And so far, the closest result was obtained by training mono-t5 for 9k steps (10k is one epoch with batch_size=4, accum_steps=32 and 2 gpus)

(TREC-COVID: original-0.7845, my-0.7899; NFCorpus: original-0.3778, my-0.3731, NQ: original-0.5676, my-0.5688, FIQA-2018: original: 0.4129, my: 0.4130)

Hello, thank you very much for your work! But I still have some questions. _batch_size=4, accumsteps=32 and 2 gpus. Then, 1 step is 4*32*2=256 batch size. The huggingface checkpoint "monot5-base-msmarco-10k" is 10k step of 128 batch size , using the first 6.4e5 lines of data from the training set. So (1) you used twice as much data as the "monot5-base-msmarco-10k"? (2) Or did you also use the first 6.4e5 lines, but train twice? (3) Or did you also use the first 6.4e5 lines, but because the batch size is twice as large, you trained 5K steps?

rodrigo-f-nogueira commented 12 months ago

Sorry about the late reply. The correct configuration should be batches of 128 examples, so 10k steps means 6.4M lines of the triples.train.small.tsv file.