naive bayes with mixed distributions

rrichardson commented 8 years ago

I am pretty much a statistics and probability newb, so If there is a better way to solve this problem which doesn't involve this approach, I am all ears.

I have a fairly large data set ( > 500k rows) which have user demographic data which I've decomposed into binary features (2 columns for gender, 50 columns for states**, etc), in fact, the only feature that could be considered continuous would be the age. I'm trying to predict a category of "hobby" for these people, which is one of 46 categories.

Doing a bit of research, it seems that I might be able to optimize my results if I use a Bernoulli distribution for the binary features, and gaussian (or perhaps other) distribution for the age. Would it be possible to set distributions by column in the bayes trainer?

\ using states as features is not optimal, I need to decompose them into region by behavior, but that's a different science project altogether (probably knn on lat/long :) )

AtheMathmo commented 8 years ago

Hey! Thanks for this issue - it sounds like a really interesting project. Before I jump into a discussion I should give a disclaimer. Rusty-machine is an early stage, experimental library. Though I've taken care to build what I think is a cool system there are bound to be some growing pains and you can expect sub-par performance in comparison to more mature, state of the art libraries. The reason I give that disclaimer is that if this project is serious you may want to look elsewhere (although I would love to try and support it with rusty-machine!).

Ok, onto the real stuff! Right now rusty-machine doesn't support what you're describing (as you could probably guess) and it would take quite a bit of work to get it there. I'm not sure that work would necessarily be the best idea as (from my admittedly limited knowledge) I don't think many people use naive bayes in this way - and it would complicate the existing model. For example looking at scikit learn's implementation they also do not support this.

You do have some options though:

Adapt existing implementations to fit what you want (or find one that supports it).
Filter your ages into categories, i.e. 0-20, 20-40, 40-60, 60+ etc. and use a column for each.
Use a different model, i.e. logistic regression with One vs. Rest (note this isn't directly supported in rusty-machine)

And probably several other options.

I'm still very happy to try and support what you're doing in rusty-machine. I'd also be happy to try and support you with adapting the existing implementation to your needs - though it would take some convincing for me to place it in the main project. If there's anything else I can do please let me know!

AtheMathmo commented 8 years ago

@rrichardson am I ok to close this issue? I am happy to continue the discussion here if you want. Though it may be easier on gitter.

rrichardson commented 8 years ago

kill it with fire.

AtheMathmo / rusty-machine

naive bayes with mixed distributions #62