Closed mattnenterprise closed 5 months ago
I may be wrong here, but does the order of the classes matter? The important is the number of times matter, but the indexes assigned to the classes are not relevant?
I may be wrong here, but does the order of the classes matter? The important is the number of times matter, but the indexes assigned to the classes are not relevant?
@josevalim Every observation in the dataset should have a corresponding label y
, in this case 0 or 1. I would say the order does matter. I could be wrong as well. :)
Having a look at this now.
Alright, I think I see where the problem might be. Function Scholar.Preprocessing.ordinal_encode
is used to encode the labels.
https://github.com/elixir-nx/scholar/blob/c5614a9d9f811494cafb0b27e34b50b8f8e8c50f/lib/scholar/naive_bayes/multinomial.ex#L313-L318
In this case 0 is mapped to 1 and vice-versa.
However, the classes themselves are set to classes_encoded = Nx.iota({num_classes})
and the encoding itself is not stored as a property. There is no way to tell which class got mapped to which label.
In scikit-learn
they use LabelBinarizer
and they set classes to the attribute of the fitted object.
https://github.com/scikit-learn/scikit-learn/blob/5c4aa5d0d90ba66247d675d4c3fc2fdfba3c39ff/sklearn/naive_bayes.py#L735-L737
Like this we know that class labelbin.classes_[i]
is mapped to label i
.
Currently, there is no way to retrieve original classes from model predictions. For example
x = Nx.iota({4, 3})
y = Nx.tensor([4, 3, 5, 3])
model = Scholar.NaiveBayes.Multinomial.fit(x, y, num_classes: 3)
Scholar.NaiveBayes.Multinomial.predict(model, Nx.tensor([[6, 2, 4], [8, 5, 9]]))
gives
#Nx.Tensor<
s64[2]
EXLA.Backend<host:0, 0.4265522339.3080323092.221915>
[0, 0]
>
but we don't know if 0 corresponds to class 3, 4, or 5.
I would say this is a bug and suggest the following. Remove Scholar.Preprocessing.ordinal_encode
from Scholar.NaiveBayes.Multinomial.fit
and assume that y
consists of numbers between 0 and num_classes - 1
(maybe perform a check). In other words, leave the encoding to the user.
@krstopro I think that's a good first improvement. We could check by returning a -1 or something, I believe we check some places in Scholar, but we can only raise if called outside of a defn
.
@josevalim We can validate the data inside deftransform
, right? I think this is done almost everywhere in Scholar inside fit
and predict
.
deftransform allows us to validate shapes and types, but not the tensor values.
deftransform allows us to validate shapes and types, but not the tensor values.
@josevalim Right, but does it make sense to check for min and max values inside it? I guess there will be some copying of the data.
No, it is not possible to do so. At best You have ti return that as part of the result and the caller checks.
No, it is not possible to do so. At best You have ti return that as part of the result and the caller checks.
Ah, correct. Sorry, I didn't think about it.
Yeah, the only way to keep labels consistent between Python and Scholar is to use a label encoder from Scikit-Learn before passing data to naive bayes.
Fixed with #248. Thanks for reporting the bug, @mattnenterprise!
I have been doing some text classification in python and have been trying to get similar results in Elixir using scholar, but I am having issues. The multinomial NB Elixir implementation looks like it is swapping the class count and class log priors for the two classes. I have attached two zip files containing .npy files to help replicate the issue.
x-train-data.zip y-train-data.zip
Here is an example python script to load the unzipped files and fit the multinomial NB classifier.
Here is an Elixir example to load these files and fit the multinomial NB scholar classifier.
If you log the Elixir multinomial NB model it will show class 0 as being in the data 596 times and class 1 being in the data 604 times. These numbers are actually backwards class 0 is in the data 604 times and class 1 is in the data 596 times.
If you look at the logged python classifier it shows the correct counts.