allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

What is the purpose of vocab.txt? #175

Closed DomHudson closed 5 years ago

DomHudson commented 5 years ago

Hi thank you for this repository!

I was wondering what is the purpose of vocab.txt if the model is character based? Why do we need an upfront vocabulary?

I notice that the readme says you can use different vocabularies for training and testing. I assume if the training vocabulary is built from the tokens from both the training and holdout sets, then the final evaluation perplexity will be misleading? Or will this not effect the evaluation?

Many thanks Dom

DomHudson commented 5 years ago

Although the model takes character inputs it is still a language model and so predicts the following token given the previous tokens (inputted as characters). Therefore, vocab.txt defines the vocabulary that the model can predict (and therefore the length of the output vector).