dfdx / NaiveBayes.jl

Naive Bayes classifier
MIT License
25 stars 19 forks source link

Use MultinomialNB for iris classification (Int -> Float) #41

Open ayush1999 opened 5 years ago

ayush1999 commented 5 years ago

I'm trying to use MultinomialNB for iris classification, but it looks like fit for MultinomialNB expects X to be Matrix{Int64}, which is not the case for my dataset. (Similar for the predict function.). Won't it be better to consider the type to be Float64, since it'd be more generic?

dfdx commented 5 years ago

As far as I remember, my initial idea was that multinomial NB uses object counts as features, and counts can't be non-integer. But now I realize it might be too limiting, so I'll try to update the code and see if tests are passing.

ablaom commented 5 years ago

Perhaps I'm missing something, but I don't believe it makes sense to apply MultinomialNB to the iris data. As @dfdx implies, the input data for a MultinomialNB is categorical not continuous.

dfdx commented 5 years ago

Tried Float64/Real for feature "counts" in multinomial-nb-with-real-counts branch - technically it works, but I'm still somewhat suspicious about the concept. At least I need to read through the code once again checking what real counts may lead to.

ablaom commented 5 years ago

A related issue: It would be useful if the "regularisation" parameter alpha in Multinomial case is allowed to be continuous (Lidstone smoothing). Currently it must be discrete, I think. I understand that 0 < alpha < 1 is a common choice.

dfdx commented 5 years ago

After several iterations of updates I now think that it might be easier to rewrite the code to Julia 1.x entirely (current version is essentially written for Julia 0.4 with compatibility adapters). This leads to another question - whether keeping a separate repo is even reasonable. If I were to re-implement current algorithms in MLJModels, how would you estimate effort for someone new to MLJ infrastructure?

ablaom commented 5 years ago

Well, that would be awesome.

If I were to re-implement current algorithms in MLJModels, how would you estimate effort for someone new to MLJ infrastructure?

Well, pretty easy, I expect. We've already written the wrapping code for Multinomial and Gauss here. So, if you kept your API the same, this would be a copy and paste operation. There are some simplifications that could be made, if you are working from scratch, but I'd be happy to forgo these. And in any case, I will be happy to provide guidance.

(There is some dancing around in our wrap because the target is categorical and MLJ is fussy about preserving unseen classes in the pool in the (probabilistic) predictions. That is, the training target "election_winner" may not have "Green Party" in the training set, but if "Green Party" is in the target pool, then the predicted distribution must have "Green Party" in its support (with probability zero).)

This leads to another question - whether keeping a separate repo is even reasonable.

From our point of view, native implementations of the MLJ interface are preferred. One reason is to better isolate testing. Another, is to distribute responsibility for maintenance and documentation. Native implementation is simple. Your NaiveBayes.jl module imports MLJBase, implements the API, and then you raise an issue at MLJRegistry requesting we make your package accessible to all MLJ users (you don't need to do this to use NB with MLJ, just to make your models findable and loadable by MLJ users that don't know about your package). See the end of the guide for details.

Have you given any thought on how to handle sparse data? (For the sake of expediency, our wrap just converts the generic tabular input provided by the MLJ user into a matrix and feeds that to NB.) If I remember correctly, you presently allow dictionary input, but I guess there are better alternatives by now?