can you please clarify , what you mean under feature distributions

Ahmedkoptan / Categorical-Naive-Bayes-Classifier-for-Mushroom-Dataset

Naive Bayes Classifier for Mushroom Dataset with Laplacian smoothing to detect whether a mushroom is edible or poisonous

1 stars 1 forks source link

can you please clarify , what you mean under feature distributions #1

Open Sandy4321 opened 4 years ago

Sandy4321 commented 4 years ago

can you please clarify , what you mean under feature distributions

for example discrete features that are categorically distributed. The categories of each feature are drawn from a categorical distribution. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html#sklearn.naive_bayes.CategoricalNB

seems to be you mean that distributions are estimated from data? for given categorical feature it is probabilities and conditional probability (values and target) from calculated from data?

as mentioned in https://datascience.stackexchange.com/questions/58720/naive-bayes-for-categorical-features-non-binary Some people recommend using MultinomialNB which according to me doesn't make sense because it considers feature values to be frequency counts

Ahmedkoptan commented 4 years ago

@Sandy4321

Yes, my implementation of Naive Bayes produces a similar result to sklearn's Categorical Naive Bayes, but here I implemented it from scratch.

However, to be able to use sklearn's Categorical Naive Bayes, you would need to use Ordinal Encoder to convert the values in a given feature to numbers (where the range of the numbers is the arity of the respective feature). In my implementation, I didn't need an Ordinal Encoder, and I used the raw feature values as is.

Moreover, to calculate the conditional probabilities (P(Xij=x|Yi=yk)), I basically followed the Maximum Likelihood Estimate of the Multinomial Distribution, where you count where both x and yk occured together and divide by the count of yk. However, this is not the same as sklearn's Multinomial NB, since that one as you described considers feature values to be frequency counts.

I hope this answers your questions.

Sandy4321 commented 4 years ago

Great thanks for answer Do also as scikit makes possible to see Likelihood s for each value in each feature In scikit learn they did not coded way to see which likelihood is for each class and not planning soon They just giving likelihoods but but how to find to Wich class it relates? Hopefully you do have it?

Ahmedkoptan commented 4 years ago

In sklearn's Categorical NB, you can use the function 'predict_proba', which will return the matrix of probabilities, where each probability is of sample i belonging to class c. You can read more about that from the link you posted earlier and from their User Guide.

In my implementation, I only had 2 classes to deal with (mushroom is edible or not), so you can simply print out 'mypreds', which will give you a probability vector of every instance being edible

Sandy4321 commented 4 years ago

no I meant Likelihood for each value from each feature

Ahmedkoptan commented 4 years ago

@Sandy4321 If you are referring to (P(Xij=x|Yi=yk)), then you would need to store all of those conditional probabilities in a 3D tensor, with dimensions (for example) [instances x features x classes].

My implementation doesn't do that, but adding a print statement of 'px_y' inside the nested for loop of the Test split/ Cross validation block will probably show you what you seek.

Sandy4321 commented 4 years ago

I see thanks