elixir-nx / scholar

Traditional machine learning on top of Nx
Apache License 2.0
411 stars 43 forks source link

Multinomial Naive Bayes bug #242

Closed mattnenterprise closed 5 months ago

mattnenterprise commented 6 months ago

I have been doing some text classification in python and have been trying to get similar results in Elixir using scholar, but I am having issues. The multinomial NB Elixir implementation looks like it is swapping the class count and class log priors for the two classes. I have attached two zip files containing .npy files to help replicate the issue.

x-train-data.zip y-train-data.zip

Here is an example python script to load the unzipped files and fit the multinomial NB classifier.

from pprint import pprint
import numpy as np
from sklearn.naive_bayes import MultinomialNB

x_train = np.load('< path to x-train-data.npy>')
y_train = np.load('< path to y-train-data.npy> ')

classifier = MultinomialNB()
classifier.fit(x_train, y_train)
pprint(vars(classifier))

Here is an Elixir example to load these files and fit the multinomial NB scholar classifier.

Mix.install([
  {:scholar, "~> 0.2.1"},
  {:exla, ">= 0.0.0"}
])

Nx.global_default_backend(EXLA.Backend)

x_train =
  "< path to x-train-data.npy>"
  |> File.read!()
  |> Nx.load_numpy!()

y_train =
  "<path to y-train-data.npy>"
  |> File.read!()
  |> Nx.load_numpy!()

model = Scholar.NaiveBayes.Multinomial.fit(x_train, y_train, num_classes: 2)

If you log the Elixir multinomial NB model it will show class 0 as being in the data 596 times and class 1 being in the data 604 times. These numbers are actually backwards class 0 is in the data 604 times and class 1 is in the data 596 times.

If you look at the logged python classifier it shows the correct counts.

josevalim commented 6 months ago

I may be wrong here, but does the order of the classes matter? The important is the number of times matter, but the indexes assigned to the classes are not relevant?

krstopro commented 6 months ago

I may be wrong here, but does the order of the classes matter? The important is the number of times matter, but the indexes assigned to the classes are not relevant?

@josevalim Every observation in the dataset should have a corresponding label y, in this case 0 or 1. I would say the order does matter. I could be wrong as well. :) Having a look at this now.

krstopro commented 6 months ago

Alright, I think I see where the problem might be. Function Scholar.Preprocessing.ordinal_encode is used to encode the labels. https://github.com/elixir-nx/scholar/blob/c5614a9d9f811494cafb0b27e34b50b8f8e8c50f/lib/scholar/naive_bayes/multinomial.ex#L313-L318

In this case 0 is mapped to 1 and vice-versa. However, the classes themselves are set to classes_encoded = Nx.iota({num_classes}) and the encoding itself is not stored as a property. There is no way to tell which class got mapped to which label. In scikit-learn they use LabelBinarizer and they set classes to the attribute of the fitted object. https://github.com/scikit-learn/scikit-learn/blob/5c4aa5d0d90ba66247d675d4c3fc2fdfba3c39ff/sklearn/naive_bayes.py#L735-L737 Like this we know that class labelbin.classes_[i] is mapped to label i.

krstopro commented 6 months ago

Currently, there is no way to retrieve original classes from model predictions. For example

x = Nx.iota({4, 3})
y = Nx.tensor([4, 3, 5, 3])
model = Scholar.NaiveBayes.Multinomial.fit(x, y, num_classes: 3)
Scholar.NaiveBayes.Multinomial.predict(model, Nx.tensor([[6, 2, 4], [8, 5, 9]]))

gives

#Nx.Tensor<
  s64[2]
  EXLA.Backend<host:0, 0.4265522339.3080323092.221915>
  [0, 0]
>

but we don't know if 0 corresponds to class 3, 4, or 5.

I would say this is a bug and suggest the following. Remove Scholar.Preprocessing.ordinal_encode from Scholar.NaiveBayes.Multinomial.fit and assume that y consists of numbers between 0 and num_classes - 1 (maybe perform a check). In other words, leave the encoding to the user.

josevalim commented 6 months ago

@krstopro I think that's a good first improvement. We could check by returning a -1 or something, I believe we check some places in Scholar, but we can only raise if called outside of a defn.

krstopro commented 6 months ago

@josevalim We can validate the data inside deftransform, right? I think this is done almost everywhere in Scholar inside fit and predict.

josevalim commented 6 months ago

deftransform allows us to validate shapes and types, but not the tensor values.

krstopro commented 6 months ago

deftransform allows us to validate shapes and types, but not the tensor values.

@josevalim Right, but does it make sense to check for min and max values inside it? I guess there will be some copying of the data.

josevalim commented 6 months ago

No, it is not possible to do so. At best You have ti return that as part of the result and the caller checks.

krstopro commented 6 months ago

No, it is not possible to do so. At best You have ti return that as part of the result and the caller checks.

Ah, correct. Sorry, I didn't think about it.

msluszniak commented 6 months ago

Yeah, the only way to keep labels consistent between Python and Scholar is to use a label encoder from Scikit-Learn before passing data to naive bayes.

krstopro commented 5 months ago

Fixed with #248. Thanks for reporting the bug, @mattnenterprise!