Element-Research / rnn

Recurrent Neural Network library for Torch7's nn
BSD 3-Clause "New" or "Revised" License
939 stars 313 forks source link

recurrent-language-model.lua oov assertion fails if "valid"/"test" have oov words #374

Closed tastyminerals closed 7 years ago

tastyminerals commented 7 years ago

Hi, I am currently working with recurrent-language-model.lua and I noticed that dl.text2tensor function behaves weird when you put some OOV words in either of PennTreebank files like "ptb.valid.txt" or "ptb.test.txt". I also added a custom OOV word to "ptb.train.txt" to see if dl.buildVocab handles this case accordingly but oov variable counter stayed 0. I hope I am not confusing this for intended behaviour. If this is a bug, then recurrent-language-model.lua won't work with any data but ptb which is split so that neither of "valid" or "test" samples have oov words.

tastyminerals commented 7 years ago

After looking closer at the code I see than one can make use of minfreq parameter to dl.buildVocab and set it to 2 for example. This would allow "valid" or "test" to have oov words.