allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Separate Training Files vs Single Training File #200

Closed agemagician closed 5 years ago

agemagician commented 5 years ago

Hello,

Does it make any difference to have the training text file as a single file or separate files ? Is the internal state rest it self between different text files ?

I tried to train a model with separate files vs single file, and so far I can see the perplexity is much lower when I have separate files.

matt-peters commented 5 years ago

It shouldn't make a difference with one vs many files, assuming they are prepared in the same manner (e.g. were generated with something like cat separate_files* > single_file.txt), you specify all of the separate files for training, and you are careful to separate the heldout validation/test files from the training files.