Palashio / libra

Ergonomic machine learning for everyone.
http://libradocs.org/
MIT License
1.92k stars 109 forks source link

Add text preprocessing for structured data #184

Closed jbofill10 closed 4 years ago

jbofill10 commented 4 years ago

Also includes changing mca_threshold to ca_threshold in various places


Add text preprocessing for structured data Applies NLP techniques to specified columns within structured data and creates a scalar value for the sum of the tf-idf vector

This pull request closes #93

- What I did Used NLTK library for tokenize, filter stop words, and convert each word to it's respective lemma Used autocorrect's speller to fix mispellings Used sklearn's TF-IDF vectorizer to give weight to the rarer words - How I did it I coded - How to verify it

Include text based columns that you do not want one hot encoded in the text param for the different models

This pull request adds a new feature to Libra. @Palashio, could you please take a look at it?