DataKind-BLR / PrathamBooks-Sprint-2018

Code and documentation for the collaboration with PrathamBooks during Sprint' 2018
MIT License
4 stars 7 forks source link

POC TF-IDF For Stories #12

Closed githubssn closed 6 years ago

githubssn commented 6 years ago

POC for TF-IDF for initial English stories

heaven00 commented 6 years ago

@githubssn awesome! just one small thing to add, do a punctuation removal to remove stopwords like don't at the moment it gets tokenized as 'don', '’', 't'

githubssn commented 6 years ago

@heaven00 Thank you for your quick review.. I had a look again.. You are right. The issue you mention is in line 26. I believe I can delete that line. In the subsequent step, this issue is eliminated as the punctuation check is done subsequently - RegExpr Tokenizer is used in the line following which eliminates the punctuation issue and gives us distinct tokens.

heaven00 commented 6 years ago

:+1:

heaven00 commented 6 years ago

Looks good to me!

arnabbiswas1 commented 6 years ago

Thank you @githubssn and @heaven00 ! Merging.