Tradeshift / blayze

A fast and flexible Naive Bayes implementation for the JVM
MIT License
19 stars 11 forks source link

Make text preprocessing minimal #26

Closed dadib closed 4 years ago

dadib commented 4 years ago

The benefit of how we currently do text preprocessing in the library is the input data can be taken in a rawer format. This is usually better for maintainability since the feature transformation logic and its compatibility with the model don't need to be maintained separately. The downside is we have little flexibility when it comes to how text is processed. If we want to do more aggressive processing we can do it before passing the text to blayze, but now we have a situation where the blayze text processing is too aggressive and hurting model performance. Since we already leave all other feature extraction/processing to the library user we might as well leave text pre-processing to them as well. This change makes it so that text is minimally processed.

Serialization version has to be bumped since features in serialized models will not be correct under the new processing logic.