Closed DomHudson closed 5 years ago
Although the model takes character inputs it is still a language model and so predicts the following token given the previous tokens (inputted as characters). Therefore, vocab.txt
defines the vocabulary that the model can predict (and therefore the length of the output vector).
Hi thank you for this repository!
I was wondering what is the purpose of
vocab.txt
if the model is character based? Why do we need an upfront vocabulary?I notice that the readme says you can use different vocabularies for training and testing. I assume if the training vocabulary is built from the tokens from both the training and holdout sets, then the final evaluation perplexity will be misleading? Or will this not effect the evaluation?
Many thanks Dom