Reproducibility on GLUE

paul-grundmann commented 3 years ago

Hi, I am currently working on reproducing the results from the paper on the GLUE benchmark. However, my current results are very far from those in the paper. Have you already conducted experiments in this direction or could you reproduce the scores?

I have a running implementation compatible with Huggingface if you want to try it out: https://github.com/paul-grundmann/transformers/blob/fnet/src/transformers/models/fnet/modeling_fnet.py

In my case, it seems that the model steadily learns on the masked language modeling task but does not improve on downstream tasks at all even after 200k pre-training steps.

erksch commented 3 years ago

No, I have not yet evaluated on downstream tasks, but it's definitely in the pipeline. Maybe I can get some runs going this weekend. But, I did some finetuning on some private tasks and it did pretty well, so I don't think there will be many problems. What implementation are you using for finetuning on GLUE?

PS: As I see you are a fellow Berliner working on FNet, maybe we can connect outside of GitHub some time :)

erksch commented 3 years ago

Also, are you planning to contribute FNet to HuggingFace? I think this would also be a cool thing.

paul-grundmann commented 3 years ago

As I see you are a fellow Berliner working on FNet, maybe we can connect outside of GitHub some time :)

Yes sure :)

My plan was to use the model for some downstream tasks with long documents. I just thought it would be easier to implement everything in the Huggingface ecosystem to leverage existing implementations of GLUE and such. But yes, if everything goes well, then of course it would be a great idea to contribute the model and source code to Huggingface.

For evaluation, I used the run_glue.py script in the examples with the following parameters.

python run_glue.py \
--task_name qnli \
--model_name_or_path ./fnet \ 
--tokenizer_name bert-base-uncased \
--output_dir glue \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 64 \
--learning_rate 1e-5 \
--num_train_epochs 3 \ 
--dataloader_num_workers=8 \ 
--max_seq_length 128

I tested SST2, CoLA and QNLI but the model did not improve on those tasks. Neither with my custom pre-training scripts nor with the one from Huggingface run_mlm.py.

But of course I cannot exclude that it is due to my implementation...

erksch commented 3 years ago

I guess you are not using the official checkpoint converted to fit your hugging face model? Because you also seem to use a different tokenizer. I conclude that you did run a pre-training from scratch. On what dataset? For how long? What was the MLM score? Maybe the model is just not trained up enough to handle fine-tuning.

erksch commented 3 years ago

I just ran SST2 from the FNet base checkpoint converted to PyTorch and it learned pretty smoothly. But I only got 0.89 validation accuracy as opposed to the 0.95 stated in the paper for FNet base.

Epochs: 3 Learning rate: 1.5e-5 Batch size: 16 for all sets

paul-grundmann commented 3 years ago

Hi Erik, I just ran some internal benchmarks on a custom pre-trained FNet base (12 layers). In doing so, I realized that I forgot the attention mask in my implementation and that I had some padded inputs in both my training and my downstream tasks. So I adjusted my implementation to simply multiply the attention mask with the embeddings in the fourier layer. This seems to be working. GLUE is still significantly worse than with a normal BERT base, but the results are no longer purely random. (~84% accuracy on SST2, ~11% correlation on CoLA). In our internal retrieval benchmarks it is also 50%-75% as good as BERT depending on the task. If you want to play around with it, I can send you the weights. They should work with the Huggingface implementation. The model was pre-trained for 125k steps with a learning rate of 7e-4, Batchsize: 2048 on an English Wikipedia and PubMed corpus

erksch / fnet-pytorch

Reproducibility on GLUE #9