flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.81k stars 2.09k forks source link

Let's calculate BPC for enwik8 dataset using Flair forward language model #1384

Closed djstrong closed 4 years ago

djstrong commented 4 years ago

In the news "Language Modeling on One GPU" at https://blog.deeplearning.ai/blog/the-batch-facebook-takes-on-deepfakes-google-ai-battles-cancer-researchers-fight-imagenet-bias-ai-grows-globally they are referring bits per character (BPC) for LSTM model with attention head (SHA-RNN).

I extended results from SHA-RNN author (https://github.com/Smerity/sha-rnn):

Model Test BPC Params LSTM Based
Krause mLSTM 1.24 46M
AWD-LSTM 1.23 44M
SHA-LSTM 1.07 63M
Flair * 1.0640 18M
12L Transformer-XL 1.06 41M  
18L Transformer-XL 1.03 88M  
Adaptive Span Transformer (Small) 1.02 38M  
Flair ** 0.9977 18M
Adaptive Span Transformer (Large) 0.98 209M  
Transformer-XL + RMS dynamic eval + decay 0.940 277M

I have appended above scores for Flair.

Script for getting enwik8 data is here: https://github.com/facebookresearch/adaptive-span/blob/a8d90b8a8481ef1ae50a73b696c290aa88d34744/get_data.sh#L3-L11

The dataset is small (86MB of training data).

* My initial experiment after 24 hour of training shows 1.064 BPC for test data. Parameters: embedding size 100, hidden state 2048, 24 hours of training with learning rate 5

** 12 hours more of training with smaller learning rate

Why Flair achieves such good results compared to other models? Maybe the corpus is too small? Are other models overfitted having more parameters?

alanakbik commented 4 years ago

Thanks for sharing these numbers. Are all models using the same dictionary? (If the UNK'ing is different, then the values cannot be directly compared.) Otherwise, yes it could be that large models overfit and don't generalize as well as smaller models here.

djstrong commented 4 years ago

Good point! I don't know, but those results are published in papers. I probably have made the mistake with vocabulary while training Flair.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.