The-Strategy-Unit / pxtextmining

Text classification of NHS patient feedback.
https://the-strategy-unit.github.io/pxtextmining/
MIT License
28 stars 8 forks source link

Mixed Naive Bayes model #21

Closed andreassot10 closed 3 years ago

andreassot10 commented 3 years ago

Create a Naive Bayes classifier that can handle TF-IDF/count data (multinomial) and categorical data. More likely, a combination of sklearn.naive_bayes.MultinomialNB or sklearn.naive_bayes.ComplementNB with sklearn.naive_bayes.CategoricalNB.

A "mixed" NB model is available here, although it combines Gaussian and Categorical.

Seems like it is easy to combine the scikit-learn APIs to make our own model. It is just a matter of multiplying the probabilities that each model (Multinomial/Complement & Categorical) predicts separately and then normalizing:

https://github.com/remykarem/mixed-naive-bayes/blob/94034528f7169a84c46977d66ea34ca4dd86dc2f/mixed_naive_bayes/mixed_naive_bayes.py#L268-L278

An alternative approach would be to consider the predicted probabilities of the two models as features in a third NB model. At the time of writing, @ChrisBeeley and I not entirely sure what the benefit over multiplication would be. It is probably best to rely on the simple multiplication with normalization.

andreassot10 commented 3 years ago

No need for all of this. Frist, a multinomial NB model with one-hot encoded categorical variables is equivalent to a categorical NB model. Second, a multinomial NB with one-hot encoded categorical data and TF-IDF data is equivalent to the product (normalized) of multinomial NB with one-hot encoded categorical data and multinomial NB with TF-IDF data.

########################################################################################
# Multinomial NB with one-hot encoded categorical data is equivalent to Categorical NB #
########################################################################################

# https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf
import numpy as np
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
rng = np.random.RandomState(1)

X = rng.randint(5, size=(6, 100))
enc.fit(X)
Xd = pd.DataFrame(enc.transform(X).toarray())

y = np.array([1, 2, 3, 4, 5, 6])

clf_cat = CategoricalNB()
clf_cat.fit(X, y)

clf_ber = BernoulliNB()
clf_ber.fit(Xd, y)

clf_mul = MultinomialNB()
clf_mul.fit(Xd, y)

clf_com = ComplementNB()
clf_com.fit(Xd, y)

clf_comnorm = ComplementNB(norm=True)
clf_comnorm.fit(Xd, y)

probs_cat = clf_cat.predict_proba(X)
probs_ber = clf_ber.predict_proba(Xd)
probs_mul = clf_mul.predict_proba(Xd)
probs_com = clf_com.predict_proba(Xd)
probs_comnorm = clf_comnorm.predict_proba(Xd)

for i in range(0, 5):
    aux = pd.DataFrame([probs_cat[i], probs_mul[i], probs_com[i], probs_comnorm[i], probs_ber[i]],
                       index = ['cat', 'mul', 'com', 'comnorm', 'ber'])
    print(aux)

########################################################################################
# Multinomial NB with one-hot encoded categorical data and TF-IDF data is equivalent to
# the product (normalized) of Multinomial NB with one-hot encoded categorical data and
# Multinomial NB with TF-IDF data #
########################################################################################
# https://towardsdatascience.com/why-how-to-use-the-naive-bayes-algorithms-in-a-regulated-industry-with-sklearn-python-code-dbd8304ab2cf
import numpy as np
import pandas as pd
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
enc = OneHotEncoder(handle_unknown='ignore')

corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
corpus_tfidf = tfidf.fit_transform(corpus)

rng = np.random.RandomState(1)
X = rng.randint(5, size=(4, 100))
enc.fit(X)
Xd = pd.DataFrame(enc.transform(X).toarray())

y = np.array([1, 2, 3, 4])

clf_mul_tfidf = MultinomialNB()
clf_mul_tfidf.fit(corpus_tfidf, y)
probs_mul_tfidf = clf_mul_tfidf.predict_proba(corpus_tfidf)

clf_mul_onehot = MultinomialNB()
clf_mul_onehot.fit(Xd, y)
probs_mul_onehot = clf_mul_onehot.predict_proba(Xd)

both = pd.concat([pd.DataFrame(corpus_tfidf.toarray()), Xd], axis=1)
clf_mul_both = MultinomialNB()
clf_mul_both.fit(both, y)
probs_mul_both = clf_mul_both.predict_proba(both)

probs_product = []
for i in range(0, len(y)):
    probs_product.append(probs_mul_onehot[i] * probs_mul_tfidf[i] / sum(probs_mul_onehot[i] * probs_mul_tfidf[i]))

print(pd.DataFrame(probs_product))
print(probs_mul_both)

Also, any continuous values will be converted into bins and will be one-hot encoded to satisfy #22.

Bottom line: nothing needs to be done- use standard multinomial/complement NB models.