Tradeshift / blayze

A fast and flexible Naive Bayes implementation for the JVM
MIT License
19 stars 11 forks source link

Ignore unseen words. #21

Closed rasmusbergpalm closed 5 years ago

rasmusbergpalm commented 5 years ago

There's no probabilistic motivation for doing this. However, for a couple of real life cases we've seen it helps a lot. Essentially, if the outcome distribution is very skewed, unseen words heavily favor the most rare outcomes, which lead to nonsensical predictions.