Open MGithubGA opened 4 years ago
CC @kartikayk
Thanks in advance for the rely! @lematt1991 @kartikayk
I have the same reproduction problem with the following settings
--learning_rate 5e-5
--batch_size 32
--n_gpu 3 (using DataParallel)
--max_steps 12000 (roughly 3 epochs)
--save_steps 2000
--warmup_steps 1200 (first 10% training steps)
--max_seq_length 128
The obtained result is 74.2 (averaged over 5 runs).
Are there steps outlined to do the XNLI fine-tuning on xlm-r?
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
For the settings I used, following are the details:
Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128
We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
We use Adam with a LR of 0.0000075 without any warmup and decay.
max_seq_length is 256.
We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
Thanks for the link and details.
Sent via Superhuman ( https://sprh.mn/?vip=jinoo.baek@quizlet.com )
On Sat, Jun 13, 2020 at 1:33 PM, kartikayk < notifications@github.com > wrote:
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
- Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
- For finetuning, you can look at the PyText Tutorial ( https:/ / github. com/ facebookresearch/ pytext/ blob/ master/ demo/ notebooks/ xlm_r_tutorial. ipynb ( https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb ) )
For the settings I used, following are the details:
Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128
We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
We use Adam with a LR of 0.0000075 without any warmup and decay.
max_seq_length is 256.
We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub ( https://github.com/pytorch/fairseq/issues/2057#issuecomment-643674771 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AMMFKWHGMR5VWX4D53W3VRDRWPPC5ANCNFSM4MP6HAUA ).
Thanks a lot @kartikayk ! By the way, how about the XLM-R large model? Does it use the same hyper parameter with the base model?
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
- Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
- For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)
For the settings I used, following are the details:
- Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128
- We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
- We use Adam with a LR of 0.0000075 without any warmup and decay.
- max_seq_length is 256.
- We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
@kartikayk Thanks for your reply, where can we find the latest checkpoint for XLMR-Base?
Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:
- Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
- For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)
For the settings I used, following are the details:
- Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128
- We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
- We use Adam with a LR of 0.0000075 without any warmup and decay.
- max_seq_length is 256.
- We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.
@kartikayk Hi! Thanks for the hyper-params for XLMR-Base.
For "Fine-tune multilingual model on English training set (Cross-lingual Transfer)" mentioned in table 1 in https://arxiv.org/pdf/1911.02116.pdf
What is the hyper-params XLMR-Large?
Also when you select the best model in this setup, did you use validation set
for all languages or only english
language?
Are there any suggested hyperparameters for fine-tuning XLM-Roberta-large on XNLI so that we can reproduce the results in XNLI correctly? Thanks! @kartikayk
I also wonder why there is a big gap between XLM-R's paper and XTREME's paper about the results of XLM-R_large. @kartikayk
@yuchenlin @luofuli I also can't reproduce the xlm-roberta-large
result under cross_lingual setting. The best accuracy I can achieve is only 80.11%, which is almost 0.8% less than the reported 80.9% accuracy. Have you found a set of hyperparameters to reproduce the result on xlm-roberta-large
?
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Thanks for the impressive work of XLM-R.
Recently I found that the results on XNLI are updated: the avg-acc of XLM-R_base is increased from 74.6 to 76.1.
I can obtain the best results 74.6 by finetuning 5 epochs with lr=1e-5 with batch size of 32, weight decay 0.1, and 10% warm up.
I have also tried the suggestion by @kartikayk from Issue-1367, but it seems doesn't work for me. I learn the model with batch size of 32 and 4-step grad accumulation, 5K steps for each epoch, and fixed lr=5e-6 or lr=5e-6 with a linear decay of lr and 10% warm up. However, I cannot obtain the results of 76.1. Maybe I miss some important details.
Could you provide me more details or your finetuning code?
Thanks.