datasize for production

daniel-kukiela / nmt-chatbot

NMT Chatbot

GNU General Public License v3.0

385 stars 213 forks source link

datasize for production #126

Open apakrash opened 5 years ago

apakrash commented 5 years ago

I tried creating using 2015-05 conversation for training the bot. The answers were less than satisfactory. How many months of data was used for Charles v1/v2?

ghost commented 5 years ago

I'm pretty certain sentdex trained a model on around 50,000,000 pairs. One month of data is definitely not great, although there is some coherence it can be better.

apakrash commented 5 years ago

do you recall how many months was that?

ghost commented 5 years ago

do you recall how many months was that?

I got 54,000,000 pairs with 9 months of comment data from Reddit. The monthly files of recent years (2017 & 2018) are quite dense, thus downloading 7-9 files should get you a decent amount of pairs.

Although, it all depends on how strict your filter is as well. I removed all comments with any links, and comments were filtered down to the ones that were no more than 500 characters.

apakrash commented 5 years ago

thanks

SkullEnemyX commented 4 years ago

do you recall how many months was that?

I got 54,000,000 pairs with 9 months of comment data from Reddit. The monthly files of recent years (2017 & 2018) are quite dense, thus downloading 7-9 files should get you a decent amount of pairs.

Although, it all depends on how strict your filter is as well. I removed all comments with any links, and comments were filtered down to the ones that were no more than 500 characters.

Can you provide with your filter which can probably save a lot of time for others to write their own?