Closed segurac closed 7 years ago
Thanks you for your contribution. It's looks good but I won't have any time this week-end for reviewing. I'll try to merge it on Monday.
I applied two minor modifications:
increaseTrainingPairs
is now the default behavior (as before) and increaseTrainingPairs
is renamed skipLines
vocabularySize
to 0
There are two features in this pull request. First one is the flag "increaseTrainingPairs", that when it is set it uses consecutive lines in the database as input and target (as always) , and when it is not set is just reads the data in steps of 2 lines, forcing first line as input and second as target. This is necessary if you want the bot to have some personality, like training on a special character from a movie.
The second feature is a limit on the total vocabulary size. Without it, using the Opensubtitles db in Spanish I got around 400k words in the dictionary, way too much. And while I could still play with the vocabFilter parameter, it is better to be able to limit the vocabulary size directly. I also remove the training samples where the target sentence contains an out-of-vocabulary word. It doesn't really make sense for the bot to output words