MLblog / jads_kaggle

Contains our group's work in various kaggle competitions
MIT License
10 stars 23 forks source link

Some EDA and basic preprocessing of Quora text for deep learning #116

Closed joepvdbogaert closed 5 years ago

joepvdbogaert commented 5 years ago

Performs simple preprocessing steps in a reusable function (hence I put it in the common directory).

  1. make text lower case
  2. replace shorthand phrases with their full form (e.g., {"won't": "will not"}).
  3. remove remaining punctuation

The method can be easily extended to perform more steps if desired. It is in any case a good start.

I included a notebook that shows how to use it and shows some examples of issues we need to consider. Also added a notebook with high level EDA and a look into question length (which will be relevant for further preprocessing of the text).