Open eric-haibin-lin opened 5 years ago
Suggestion from the author regarding implementation https://github.com/google-research/bert/issues/38
still not able to get 81.6 on base model, all i can get is 80.16
hi @omerarshad Thanks for looking into this! What is your setup and what hyper-parameters did you try?
learning rate 2e-5 batch size 16 model bert-base-uncased gradient accumulation 4
Hi @omerarshad sorry for the late reply. I noticed that in the paper the author used the following hyper-parameters:
We fine-tune the model for 3 epochs with a learning rate of 2e-5 and a batch size of 16.
In your case, since you used batch_size = 16 and grad_accumulation=4, your effective batch size is 64. What about setting epoch=3, batchsize=16, accumulate=1 and trying again?
Nonetheless 80.16 is a good starting point and is much better than the ELMo baseline. Hopefully we can get your work into gluonnlp :)
@omerarshad any luck?
80.69 was last i achieved week ago.
@omerarshad thanks! what were the hyperparameters?
@omerarshad did you check average length of the corpus and the max length used for finetuning? I think the default in the script is 128
I have got 80.96 on bert-base (bert-base-lr_5e-05-trEpo5.0-MaxLen80-TrBN16-savStp_1000calGradiStp_2-warm_0) What's the baseline for the BERT-BASE model? Any reference for 81? @szha @omerarshad Thanks
@mingtop thanks for sharing. The original BERT paper reported 81.6 on dev set in table 4.
Anyone wants to help reproduce the result reported by BERT on the grounded commonsense inference task on SWAG?