dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.55k stars 538 forks source link

[BERT] Reproduce BERT on SWAG #599

Open eric-haibin-lin opened 5 years ago

eric-haibin-lin commented 5 years ago

Anyone wants to help reproduce the result reported by BERT on the grounded commonsense inference task on SWAG?

eric-haibin-lin commented 5 years ago

Suggestion from the author regarding implementation https://github.com/google-research/bert/issues/38

omerarshad commented 5 years ago

still not able to get 81.6 on base model, all i can get is 80.16

eric-haibin-lin commented 5 years ago

hi @omerarshad Thanks for looking into this! What is your setup and what hyper-parameters did you try?

omerarshad commented 5 years ago

learning rate 2e-5 batch size 16 model bert-base-uncased gradient accumulation 4

eric-haibin-lin commented 5 years ago

Hi @omerarshad sorry for the late reply. I noticed that in the paper the author used the following hyper-parameters:

We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16.

In your case, since you used batch_size = 16 and grad_accumulation=4, your effective batch size is 64. What about setting epoch=3, batchsize=16, accumulate=1 and trying again?

Nonetheless 80.16 is a good starting point and is much better than the ELMo baseline. Hopefully we can get your work into gluonnlp :)

eric-haibin-lin commented 5 years ago

@omerarshad any luck?

omerarshad commented 5 years ago

80.69 was last i achieved week ago.

szha commented 5 years ago

@omerarshad thanks! what were the hyperparameters?

eric-haibin-lin commented 5 years ago

@omerarshad did you check average length of the corpus and the max length used for finetuning? I think the default in the script is 128

mingtop commented 4 years ago

I have got 80.96 on bert-base (bert-base-lr_5e-05-trEpo5.0-MaxLen80-TrBN16-savStp_1000calGradiStp_2-warm_0) What's the baseline for the BERT-BASE model? Any reference for 81? @szha @omerarshad Thanks

szha commented 4 years ago

@mingtop thanks for sharing. The original BERT paper reported 81.6 on dev set in table 4.