Let's calculate BPC for enwik8 dataset using Flair forward language model

djstrong commented 4 years ago

In the news "Language Modeling on One GPU" at https://blog.deeplearning.ai/blog/the-batch-facebook-takes-on-deepfakes-google-ai-battles-cancer-researchers-fight-imagenet-bias-ai-grows-globally they are referring bits per character (BPC) for LSTM model with attention head (SHA-RNN).

I extended results from SHA-RNN author (https://github.com/Smerity/sha-rnn):

Model	Test BPC	Params	LSTM Based
Krause mLSTM	1.24	46M	✔
AWD-LSTM	1.23	44M	✔
SHA-LSTM	1.07	63M	✔
Flair *	1.0640	18M	✔
12L Transformer-XL	1.06	41M
18L Transformer-XL	1.03	88M
Adaptive Span Transformer (Small)	1.02	38M
Flair **	0.9977	18M	✔
Adaptive Span Transformer (Large)	0.98	209M
Transformer-XL + RMS dynamic eval + decay	0.940	277M

I have appended above scores for Flair.

Script for getting enwik8 data is here: https://github.com/facebookresearch/adaptive-span/blob/a8d90b8a8481ef1ae50a73b696c290aa88d34744/get_data.sh#L3-L11

The dataset is small (86MB of training data).

* My initial experiment after 24 hour of training shows 1.064 BPC for test data. Parameters: embedding size 100, hidden state 2048, 24 hours of training with learning rate 5

** 12 hours more of training with smaller learning rate

Why Flair achieves such good results compared to other models? Maybe the corpus is too small? Are other models overfitted having more parameters?

alanakbik commented 4 years ago

Thanks for sharing these numbers. Are all models using the same dictionary? (If the UNK'ing is different, then the values cannot be directly compared.) Otherwise, yes it could be that large models overfit and don't generalize as well as smaller models here.

djstrong commented 4 years ago

Good point! I don't know, but those results are published in papers. I probably have made the mistake with vocabulary while training Flair.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

flairNLP / flair

Let's calculate BPC for enwik8 dataset using Flair forward language model #1384