Closed andreassot10 closed 3 years ago
No need for all of this. Frist, a multinomial NB model with one-hot encoded categorical variables is equivalent to a categorical NB model. Second, a multinomial NB with one-hot encoded categorical data and TF-IDF data is equivalent to the product (normalized) of multinomial NB with one-hot encoded categorical data and multinomial NB with TF-IDF data.
########################################################################################
# Multinomial NB with one-hot encoded categorical data is equivalent to Categorical NB #
########################################################################################
# https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf
import numpy as np
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
enc.fit(X)
Xd = pd.DataFrame(enc.transform(X).toarray())
y = np.array([1, 2, 3, 4, 5, 6])
clf_cat = CategoricalNB()
clf_cat.fit(X, y)
clf_ber = BernoulliNB()
clf_ber.fit(Xd, y)
clf_mul = MultinomialNB()
clf_mul.fit(Xd, y)
clf_com = ComplementNB()
clf_com.fit(Xd, y)
clf_comnorm = ComplementNB(norm=True)
clf_comnorm.fit(Xd, y)
probs_cat = clf_cat.predict_proba(X)
probs_ber = clf_ber.predict_proba(Xd)
probs_mul = clf_mul.predict_proba(Xd)
probs_com = clf_com.predict_proba(Xd)
probs_comnorm = clf_comnorm.predict_proba(Xd)
for i in range(0, 5):
aux = pd.DataFrame([probs_cat[i], probs_mul[i], probs_com[i], probs_comnorm[i], probs_ber[i]],
index = ['cat', 'mul', 'com', 'comnorm', 'ber'])
print(aux)
########################################################################################
# Multinomial NB with one-hot encoded categorical data and TF-IDF data is equivalent to
# the product (normalized) of Multinomial NB with one-hot encoded categorical data and
# Multinomial NB with TF-IDF data #
########################################################################################
# https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf
import numpy as np
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
enc = OneHotEncoder(handle_unknown='ignore')
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
corpus_tfidf = tfidf.fit_transform(corpus)
rng = np.random.RandomState(1)
X = rng.randint(5, size=(4, 100))
enc.fit(X)
Xd = pd.DataFrame(enc.transform(X).toarray())
y = np.array([1, 2, 3, 4])
clf_mul_tfidf = MultinomialNB()
clf_mul_tfidf.fit(corpus_tfidf, y)
probs_mul_tfidf = clf_mul_tfidf.predict_proba(corpus_tfidf)
clf_mul_onehot = MultinomialNB()
clf_mul_onehot.fit(Xd, y)
probs_mul_onehot = clf_mul_onehot.predict_proba(Xd)
both = pd.concat([pd.DataFrame(corpus_tfidf.toarray()), Xd], axis=1)
clf_mul_both = MultinomialNB()
clf_mul_both.fit(both, y)
probs_mul_both = clf_mul_both.predict_proba(both)
probs_product = []
for i in range(0, len(y)):
probs_product.append(probs_mul_onehot[i] * probs_mul_tfidf[i] / sum(probs_mul_onehot[i] * probs_mul_tfidf[i]))
print(pd.DataFrame(probs_product))
print(probs_mul_both)
Also, any continuous values will be converted into bins and will be one-hot encoded to satisfy #22.
Bottom line: nothing needs to be done- use standard multinomial/complement NB models.
Create a Naive Bayes classifier that can handle TF-IDF/count data (multinomial) and categorical data. More likely, a combination of
sklearn.naive_bayes.MultinomialNB
orsklearn.naive_bayes.ComplementNB
withsklearn.naive_bayes.CategoricalNB
.A "mixed" NB model is available here, although it combines Gaussian and Categorical.
Seems like it is easy to combine the
scikit-learn
APIs to make our own model. It is just a matter of multiplying the probabilities that each model (Multinomial/Complement & Categorical) predicts separately and then normalizing:https://github.com/remykarem/mixed-naive-bayes/blob/94034528f7169a84c46977d66ea34ca4dd86dc2f/mixed_naive_bayes/mixed_naive_bayes.py#L268-L278
An alternative approach would be to consider the predicted probabilities of the two models as features in a third NB model. At the time of writing, @ChrisBeeley and I not entirely sure what the benefit over multiplication would be. It is probably best to rely on the simple multiplication with normalization.