XNLI Results Reproduction of XLM-R

MGithubGA commented 4 years ago

Thanks for the impressive work of XLM-R.

Recently I found that the results on XNLI are updated: the avg-acc of XLM-R_base is increased from 74.6 to 76.1.

I can obtain the best results 74.6 by finetuning 5 epochs with lr=1e-5 with batch size of 32, weight decay 0.1, and 10% warm up.

I have also tried the suggestion by @kartikayk from Issue-1367, but it seems doesn't work for me. I learn the model with batch size of 32 and 4-step grad accumulation, 5K steps for each epoch, and fixed lr=5e-6 or lr=5e-6 with a linear decay of lr and 10% warm up. However, I cannot obtain the results of 76.1. Maybe I miss some important details.

Could you provide me more details or your finetuning code?

Thanks.

lematt1991 commented 4 years ago

CC @kartikayk

MGithubGA commented 4 years ago

Thanks in advance for the rely! @lematt1991 @kartikayk

lixin4ever commented 4 years ago

I have the same reproduction problem with the following settings

--learning_rate 5e-5
--batch_size 32
--n_gpu 3 (using DataParallel)
--max_steps 12000 (roughly 3 epochs)
--save_steps 2000
--warmup_steps 1200 (first 10% training steps)
--max_seq_length 128

The obtained result is 74.2 (averaged over 5 runs).

jinoobaek-qz commented 4 years ago

Are there steps outlined to do the XNLI fine-tuning on xlm-r?

kartikayk commented 4 years ago

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).
For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)

For the settings I used, following are the details:

Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128
We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.
We use Adam with a LR of 0.0000075 without any warmup and decay.
max_seq_length is 256.
We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

jinoobaek-qz commented 4 years ago

Thanks for the link and details.

Sent via Superhuman ( https://sprh.mn/?vip=jinoo.baek@quizlet.com )

On Sat, Jun 13, 2020 at 1:33 PM, kartikayk < notifications@github.com > wrote:

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).

For finetuning, you can look at the PyText Tutorial ( https:/ / github. com/ facebookresearch/ pytext/ blob/ master/ demo/ notebooks/ xlm_r_tutorial. ipynb ( https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb ) )

For the settings I used, following are the details:

Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128

We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.

We use Adam with a LR of 0.0000075 without any warmup and decay.

max_seq_length is 256.

We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub ( https://github.com/pytorch/fairseq/issues/2057#issuecomment-643674771 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AMMFKWHGMR5VWX4D53W3VRDRWPPC5ANCNFSM4MP6HAUA ).

MGithubGA commented 4 years ago

Thanks a lot @kartikayk ! By the way, how about the XLM-R large model? Does it use the same hyper parameter with the base model?

ruizewang commented 4 years ago

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).

For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)

For the settings I used, following are the details:

Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128

We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.

We use Adam with a LR of 0.0000075 without any warmup and decay.

max_seq_length is 256.

We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

@kartikayk Thanks for your reply, where can we find the latest checkpoint for XLMR-Base?

sbmaruf commented 4 years ago

Hi! Apologies for the delayed response here, seems like I missed some questions. A couple of comments:

Please ensure that you have the latest checkpoint for XLMR-Base here. The updated numbers in the paper are with a checkpoint that was trained for 1.5M updates (more details on the fairseq page).

For finetuning, you can look at the PyText Tutorial (https://github.com/facebookresearch/pytext/blob/master/demo/notebooks/xlm_r_tutorial.ipynb)

For the settings I used, following are the details:

Batch Size: -- batch_size_per_gpu = 8 -- num_gpus = 8 -- gradient_accumulation_steps = 2 -- effective_batch_size = 8 8 2 = 128

We run validation after each epoch - where the epoch consists of 10K batches with data randomly sampled from the training set - and select the checkpoint with the best validation set result. This is quite important. In all we run training for 30 epochs.

We use Adam with a LR of 0.0000075 without any warmup and decay.

max_seq_length is 256.

We select the model with the best result on the validation set and then pick the final number on the test set by averaging the results from 5 runs.

@kartikayk Hi! Thanks for the hyper-params for XLMR-Base. For "Fine-tune multilingual model on English training set (Cross-lingual Transfer)" mentioned in table 1 in https://arxiv.org/pdf/1911.02116.pdf What is the hyper-params XLMR-Large? Also when you select the best model in this setup, did you use validation set for all languages or only english language?

luofuli commented 4 years ago

What's the difference between the latest XLM-R checkpoint and the first checkpoint?

And Is the improvements of the second versions of results in the XLM-R papers (v1, v2) are mainly from the pre-trained models? Or from the fine-tune strategies? @kartikayk

yuchenlin commented 3 years ago

Are there any suggested hyperparameters for fine-tuning XLM-Roberta-large on XNLI so that we can reproduce the results in XNLI correctly? Thanks! @kartikayk

luofuli commented 3 years ago

I also wonder why there is a big gap between XLM-R's paper and XTREME's paper about the results of XLM-R_large. @kartikayk

ntudy commented 3 years ago

@yuchenlin @luofuli I also can't reproduce the xlm-roberta-large result under cross_lingual setting. The best accuracy I can achieve is only 80.11%, which is almost 0.8% less than the reported 80.9% accuracy. Have you found a set of hyperparameters to reproduce the result on xlm-roberta-large ?

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

facebookresearch / fairseq

XNLI Results Reproduction of XLM-R #2057