all-contributors / ac-learn

ML platform for all contributors
MIT License
5 stars 4 forks source link

Pre-processing improvements #12

Closed Berkmann18 closed 4 years ago

Berkmann18 commented 5 years ago

At the moment the feature extractor is essentially just an NGramsOfWords-like function but it previously outperformed the extract() function from ./extract (which was using a lemmatizer); that being said, I think the feature extractor could be improved to including a stemming/lemmatization step (as well as a normalisation step like limdu.features.LowerCaseNormalizer)

Another thing to consider would be to get rid of useless instances categorised as null.