Simple document classifier (AKA spam filter)

JuliaText / TextAnalysis.jl

Julia package for text analysis

Other

374 stars 96 forks source link

Simple document classifier (AKA spam filter) #106

Closed MikeInnes closed 5 years ago

MikeInnes commented 5 years ago

julia> using TextAnalysis: SpamFilter, fit!, predict

julia> m = SpamFilter([:ham, :spam]);

julia> fit!(m, "this is ham", :ham);

julia> fit!(m, "this is spam", :spam);

julia> predict(m, "is this spam?")
Dict{Symbol,Float64} with 2 entries:
  :spam => 0.666667
  :ham  => 0.333333

This is a very simple document classifier that's very easy to use. I think it would make a nice entry point for many into the Julia ML ecosystem, as it's widely useful and can be a starting point for any more complex models people want to try out.

zgornel commented 5 years ago

Hi, Do you think it would make sense to make another package i.e. TextAnalysisModels.jl where this particular model and other standard NLP/ML models could reside ? The sentiment analysis model and the LSA/LDA would fit there as well.

In a way, such a package would provide a link between pure text processing and representation, embeddings APIs and ML packages. Hopefully, that would also spur a bit more work and research into providing more specific processing support for text modelling.

TextAnalysis seems to already be a large package (in terms of scope) and a tightly coupled modeling package repository would (in my opinion) be a welcomed addition, even if just for keeping individual package complexity to manageable levels for all users.

MikeInnes commented 5 years ago

What would be the difference between TextAnalysis and TextAnalysisModels? How does one decide (or figure out, as a user) what lives where?

I don't really see that this adds any real complexity; things can always be split out later if there's a compelling reason to, but until then it just seems like fragmentation that users have to deal with. Much better to have a central point where available functionality is clearly visible.

zgornel commented 5 years ago

So no. Thanks.

aviks commented 5 years ago

So on the question of splitting out the models, I've sorta changed my mind, and decided to do that #111

aviks commented 5 years ago

I've renamed this to NaiveBayesClassifier, and added a naive test. Could do with some documentation.