Closed ablaom closed 1 year ago
MLJLM uses base data types (vector of reals), so it doesn't get the levels information and will just see a vector with a single unique element. It may be made to take an optional levels
argument to guarantee that this line:
gets a c=2
in this case as opposed to c=1
(via maximum(y)
here)
that would mean calling fit
with an additional argument but I think that would be fine.
This here
should get the right number of classes though via the interface. Would need to figure out why the nclasses
is not passed properly (or not computed properly)
there's definitely a bug here somewhere
It has something to do with the encoding, which is weird. In the encoding, the classes are number starting with -1, not 0 or 1.
@tlienart The problem is that, for some reason I don't understand at all, the encoding of y
is special-cased if the pool of y has only two classes:
If we subsample and only see one of the two classes, then the encoded y
looks like [-1, -1, ..., -1]
. In view of this, I think the definition of getc
at this line
is incorrect. The problem for me is that I really can't figure out what this getc
is computing, as I'm not familiar enough with the code. What does the scratch
function do, for example? I think you're the only one who can safely make a fix here.
For what it's worth, a better and safer design would probably be to remove all this binary special-casing altogether, if that makes sense here. But maybe you have your reasons...
the distinction binary/multiclass is in the use of the internal representation of the vector. For binary it's more convenient to have -1,1
as it allows to have computations with a single column as opposed to doing computation with one column per class. I'm not quite ready to say that this is hugely essential and warrants the code style, but when initially working on this, I was keen to try to do stuff like this to squeeze out a bit more performances. Same with scratch space which initialises a bunch of arrays in which computations can be done in place so that you only need to allocate once.
Anyway here the problem is that I was trying to do a bit too much for the user:
nclasses = 0
this would use a fallback where it tries to guess from data which is clearly undesirable here. anyway, I've removed this by only recoding to -1/1 if the user explicitly specifies Binary; otherwise a Multinomial case with redundant computations is used. A test case with your example is added. maybe not the way you'd have preferred but I can't do much more at the moment
In the training target below we have
length(levels(y)) == 2
buty
itself only exhibits one class. This is crashingfit
. Occasionally, especially in smaller data sets, a large class may be "hidden" when we restrict to a particular fold, so this is an issue.