Open Sandy4321 opened 4 years ago
for details pls see https://github.com/scikit-learn/scikit-learn/issues/10856
about CategoricalNB and GeneralNB and https://github.com/remykarem/mixed-naive-bayes
https://datascience.stackexchange.com/questions/58720/naive-bayes-for-categorical-features-non-binary Some people recommend using MultinomialNB which according to me doesn't make sense because it considers feature values to be frequency counts
Can you please comment this Since already 8 months past when question was created? Just yes or no?
The best way to use the results of the paper is to use the bernoulli naive bayes in this problem. A categorical feature can be converted into a binary feature vector; for example if the first feature has 3 categorial values {1,2,3} this first feature can be converted into a new feature that is three dimensional where [1 0 0], [0 1 0], and [0 0 1] represent the categorial variables 1,2, and 3 respectively.
Another approach is to work through the derivation in the paper but now use categorial distributions instead of bernoulli or the multinomial conditional probability distributions.
Great thanks for answering But Another approach is to work through the derivation in the paper but now use categorial distributions instead of bernoulli or the multinomial conditional probability distributions.
Then it will not be spars data matrix Since categorical data is dense data For matrix cell we have some not zero categorical value?
1 can be categorical features used as input to nfs.fit_transform nfs = NaiveFeatureSelection(k=kv)
Use fit_transform to extract selected features
X_new = nfs.fit_transform(X_train, y_train)
if not , should data be integer feature counts? per The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work then can it be be like this https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
so can be your code used as nfs.fit_transform(X, y) where
2 for code as it is for today can be used nfs.fit_transform(X, y) where
for example data will be like this rng.randint(2, size=(6, 10)) array([[0, 1, 0, 1, 1, 0, 1, 0, 0, 1], [0, 0, 0, 1, 0, 1, 1, 1, 0, 1], [0, 1, 0, 1, 0, 1, 1, 0, 1, 0], [0, 0, 0, 1, 0, 1, 0, 0, 1, 0], [1, 1, 0, 0, 1, 0, 0, 1, 0, 0], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0]])
3 as we can see in https://scikit-learn.org/stable/modules/naive_bayes.html#multinomial-naive-bayes
The decision rule for Bernoulli naive Bayes is based on
which differs from multinomial NB’s rule in that it explicitly penalizes the non-occurrence of a feature that is an indicator for class , where the multinomial variant would simply ignore a non-occurring feature
Important question is : your current implementation explicitly penalizes the non-occurrence of a feature or not penalizes ? then if LabelEncoder https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html used for transforming categorical features to integers
for example as described in https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn
Import LabelEncoder
from sklearn import preprocessing
creating labelEncoder
le = preprocessing.LabelEncoder()
Converting string labels into numbers.
wheather_encoded=le.fit_transform(wheather)
Converting string labels into numbers
temp_encoded=le.fit_transform(temp) label=le.fit_transform(play)
should be 1 added to encoded values?
or you would suggest another way to transform categorical features?
Thanks...