Feature/vocab limit - Githubissues

segurac commented 7 years ago

There are two features in this pull request. First one is the flag "increaseTrainingPairs", that when it is set it uses consecutive lines in the database as input and target (as always) , and when it is not set is just reads the data in steps of 2 lines, forcing first line as input and second as target. This is necessary if you want the bot to have some personality, like training on a special character from a movie.

The second feature is a limit on the total vocabulary size. Without it, using the Opensubtitles db in Spanish I got around 400k words in the dictionary, way too much. And while I could still play with the vocabFilter parameter, it is better to be able to limit the vocabulary size directly. I also remove the training samples where the target sentence contains an out-of-vocabulary word. It doesn't really make sense for the bot to output words

Conchylicultor commented 7 years ago

Thanks you for your contribution. It's looks good but I won't have any time this week-end for reviewing. I'll try to merge it on Monday.

Conchylicultor commented 7 years ago

I applied two minor modifications:

increaseTrainingPairs is now the default behavior (as before) and increaseTrainingPairs is renamed skipLines
It is possible to remove the word limit by setting vocabularySize to 0

Conchylicultor / DeepQA

Feature/vocab limit #97