Closed djstrong closed 4 years ago
Thanks for sharing these numbers. Are all models using the same dictionary? (If the UNK'ing is different, then the values cannot be directly compared.) Otherwise, yes it could be that large models overfit and don't generalize as well as smaller models here.
Good point! I don't know, but those results are published in papers. I probably have made the mistake with vocabulary while training Flair.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
In the news "Language Modeling on One GPU" at https://blog.deeplearning.ai/blog/the-batch-facebook-takes-on-deepfakes-google-ai-battles-cancer-researchers-fight-imagenet-bias-ai-grows-globally they are referring bits per character (BPC) for LSTM model with attention head (SHA-RNN).
I extended results from SHA-RNN author (https://github.com/Smerity/sha-rnn):
I have appended above scores for Flair.
Script for getting enwik8 data is here: https://github.com/facebookresearch/adaptive-span/blob/a8d90b8a8481ef1ae50a73b696c290aa88d34744/get_data.sh#L3-L11
The dataset is small (86MB of training data).
* My initial experiment after 24 hour of training shows 1.064 BPC for test data. Parameters: embedding size 100, hidden state 2048, 24 hours of training with learning rate 5
** 12 hours more of training with smaller learning rate
Why Flair achieves such good results compared to other models? Maybe the corpus is too small? Are other models overfitted having more parameters?