Enhance treatment of missing value in one-hot encoder

ablaom commented 2 years ago

There is now missing value handling in OneHotEncoder but this simply propagates the missing values. I guess it might be nice to offer some other popular options for handling missing values which might be complicated to handle in a post-processing step. See also the discussion here.

@Chandu-4444 @Frank-III @OlivierLabayle

Chandu-4444 commented 2 years ago

The current implementation comes under the all-missing case. This is the easiest and most straightforward case I can say. Any other cases like all-zero, and category can also be implemented, and I guess I can use a part of my previous commit (link) for incorporating these. A simple modification to it and the current implementation for handling missing values in OneHotEncoder can enable all the above-mentioned methods.

Any other ideas would be most welcomed.

ablaom commented 2 years ago

all-zero looks like the simplest. One question for category is how to handle missing values that appear for a feature that did not havemissing values in training (fit). Here's a proposal for this:

We introduce a new hyper-parameter features_with_missing which can either be: (i) a vector of feature names, (ii) the symbol :all, (iii) the symbol :auto. For such features, when specified as a vector, we will always have the extra missing category, regardless of the existence of missing values in the input for transform. If features_with_missings == :auto then the actual list used is inferred from the training data: a feature is on the list if missing appears for that feature in the training data. If features_with_missings === :all then every feature gets the extra missing category.

In transform, if missing appears for a feature not on the list, then an informative error is thrown, explaining the possibility that the problem can be corrected by retraining and explicitly specifying features_with_missings appropriately.

The default could be :all or :auto. Maybe :auto is okay. It might lead to a surprise for the user that never reads documentation, but the error message explains what to do.

We will also need a hyper-parameter to specify the kind of missing handling - :propogate, :all_zero or :category. Name suggestion: ~~missing_handling~~ handle_missing (for consistency with sk-learn). Default: :propogate. If missing_handling is not :category, and features_with_missing is not it's default value, then clean! should issue a warning that features_with_missing is being ignored. Or we could combine the two new hyper-parameters into one somehow, although I'm not sure how to do this without creating cognitive dissonance.

I wonder how this is handled elsewhere. Of course, often one-hot encoding is sometimes implemented as a "static" transformer (no seperate training step) and this doesn't come up. This is not, however, an argument for making it static, in my view. I think it is preferable to have a consistent number of spawned features in the output, each time transform is called. That is, by training just once, you can arrange that the number of spawned features does not depend on whether there are - or are not - missing values in a particular field to be transformed. For otherwise downstream operations, expecting a certain number of features might fail unexpectedly.

Anyone have a different suggestion?

Probably good to introduce the two options in separate PR's, starting with the easiest all-zero case.

Chandu-4444 commented 2 years ago

This page can help relate a few things said by @ablaom.

Chandu-4444 commented 2 years ago

Is this how the output should be for the minimal all-zero case?

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(missing_handling = "all-zero")

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0],
name_missing = [0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

ablaom commented 2 years ago

No, rather it's the same as the current behaviour, except instead of missings, use zeros. You don't need to spawn an extra column in this case:

julia> X = (name = categorical(["a", "b", "c", "a", "b", missing]),)

julia> enc = OneHotEncoder(handle_missing = :all_zero)

# After some steps ...

(name__a = [1.0, 0.0, 0.0, 1.0, 0.0, 0.0],
 name__b = [0.0, 1.0, 0.0, 0.0, 1.0, 0.0],
 name__c = [0.0, 0.0, 1.0, 0.0, 0.0, 0.0])

However, note that this means we cannot have drop_last=true in this case, because then we can't distinguish missing from the last class. So I suggest clean! needs to check this. I suggest that if handle_missing == :all_zero, then clean! changes drop_last to false, if it is true, issuing a warning in that case.

Also:

let's use the name handle_missing for consistency with sk-learn
let's use symbols for it's values, not strings

olivierlabayle commented 2 years ago

Thank you for adding the support for propagating missing values! I think I have identified a bug if the first value in a vector is missing:

using MLJModels, CategoricalArrays, MLJBase
X = (x=categorical([missing, 1, 2, 1]),)
t  = OneHotEncoder(drop_last = true)
f, _, report = MLJBase.fit(t, 1, X)

This is due to this line. I think replacing by classes(col) should work?

ablaom commented 2 years ago

Yes, great catch, that's a bug: https://github.com/JuliaAI/MLJModels.jl/issues/467

Are you willing an able to make a PR with a test?

olivierlabayle commented 2 years ago

I can give it a try if it's as easy as my suggestion, can you grant me access to the repo?

ablaom commented 2 years ago

Done. You have an invitation to accept.

olivierlabayle commented 2 years ago

https://github.com/JuliaAI/MLJModels.jl/pull/468

JuliaAI / MLJModels.jl

Enhance treatment of missing value in one-hot encoder #458