Closed rrichardson closed 8 years ago
Hey! Thanks for this issue - it sounds like a really interesting project. Before I jump into a discussion I should give a disclaimer. Rusty-machine is an early stage, experimental library. Though I've taken care to build what I think is a cool system there are bound to be some growing pains and you can expect sub-par performance in comparison to more mature, state of the art libraries. The reason I give that disclaimer is that if this project is serious you may want to look elsewhere (although I would love to try and support it with rusty-machine!).
Ok, onto the real stuff! Right now rusty-machine doesn't support what you're describing (as you could probably guess) and it would take quite a bit of work to get it there. I'm not sure that work would necessarily be the best idea as (from my admittedly limited knowledge) I don't think many people use naive bayes in this way - and it would complicate the existing model. For example looking at scikit learn's implementation they also do not support this.
You do have some options though:
And probably several other options.
I'm still very happy to try and support what you're doing in rusty-machine. I'd also be happy to try and support you with adapting the existing implementation to your needs - though it would take some convincing for me to place it in the main project. If there's anything else I can do please let me know!
@rrichardson am I ok to close this issue? I am happy to continue the discussion here if you want. Though it may be easier on gitter.
kill it with fire.
I am pretty much a statistics and probability newb, so If there is a better way to solve this problem which doesn't involve this approach, I am all ears.
I have a fairly large data set ( > 500k rows) which have user demographic data which I've decomposed into binary features (2 columns for gender, 50 columns for states**, etc), in fact, the only feature that could be considered continuous would be the age. I'm trying to predict a category of "hobby" for these people, which is one of 46 categories.
Doing a bit of research, it seems that I might be able to optimize my results if I use a Bernoulli distribution for the binary features, and gaussian (or perhaps other) distribution for the age. Would it be possible to set distributions by column in the bayes trainer?
\ using states as features is not optimal, I need to decompose them into region by behavior, but that's a different science project altogether (probably knn on lat/long :) )