FluxML / MLJFlux.jl

Wrapping deep learning models from the package Flux.jl for use in the MLJ.jl toolbox
http://fluxml.ai/MLJFlux.jl/
MIT License
145 stars 17 forks source link

Ordinal encoding not working as expected #275

Open ablaom opened 1 week ago

ablaom commented 1 week ago

In stepping through fit for NeuraNetworkRegressor, using the data at the top of the test file regressors.jl, I am getting some unexpected behaviour.

Here is a minimal version of that data giving the same behaviour:

using MLJBase, MLJFlux, Tables

X = (
     ; Column2 = categorical(repeat(['a', 'b', 'c'], 10)),
    Column3 = categorical(repeat(["b", "c", "d"], 10), ordered = true),
)
y = rand(Float32, 30)

schema(X)
# ┌─────────┬──────────────────┬──────────────────────────────────┐
# │ names   │ scitypes         │ types                            │
# ├─────────┼──────────────────┼──────────────────────────────────┤
# │ Column2 │ Multiclass{5}    │ CategoricalValue{Char, UInt32}   │
# │ Column3 │ OrderedFactor{4} │ CategoricalValue{String, UInt32} │
# └─────────┴──────────────────┴──────────────────────────────────┘

And the model:

model = NeuralNetworkRegressor()

Okay, now the following lines are copied from fit, as given in "src/mlj_model_iinterface.jl" on the dev branch:

# Get input properties
shape = MLJFlux.shape(model, X, y)
cat_inds = MLJFlux.get_cat_inds(X)
pure_continuous_input = isempty(cat_inds)

# Decide whether to enable entity embeddings (e.g., ImageClassifier won't)
enable_entity_embs = MLJFlux.is_embedding_enabled(model) && !pure_continuous_input

# Prepare entity embeddings inputs and encode X if entity embeddings enabled
featnames = []
if enable_entity_embs
    X = MLJFlux.convert_to_table(X)
    featnames = Tables.schema(X).names
end

# entityprops is (index = cat_inds[i], levels = num_levels[i], newdim = newdims[i])
# for each categorical feature
default_embedding_dims = enable_entity_embs ? model.embedding_dims : Dict{Symbol, Real}()
entityprops, entityemb_output_dim =
    MLJFlux.prepare_entityembs(X, featnames, cat_inds, default_embedding_dims)
X, ordinal_mappings = MLJFlux.ordinal_encoder_fit_transform(X; featinds = cat_inds)

At this point I expect X to have Continuous scitype - no more categoricals. However:

schema(X)
# ┌─────────┬──────────────────┬─────────────────────────────────────────┐
# │ names   │ scitypes         │ types                                   │
# ├─────────┼──────────────────┼─────────────────────────────────────────┤
# │ Column2 │ Multiclass{3}    │ CategoricalValue{AbstractFloat, UInt32} │
# │ Column3 │ OrderedFactor{3} │ CategoricalValue{AbstractFloat, UInt32} │
# └─────────┴──────────────────┴─────────────────────────────────────────┘

The raw element type is Float32 but these are getting wrapped as categorical vectors.

typeof(X.Column2)
CategoricalVector{AbstractFloat, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}} (alias for CategoricalArray{AbstractFloat, 1, UInt32, AbstractFloat, CategoricalValue{AbstractFloat, UInt32}, Union{}})
EssamWisam commented 1 week ago

Thanks for catching this. I believe it's highly likely because I use recode from categorical arrays during transform. This was intended to preserve the categorical type (which is useful in MLJTransforms where the indices are kept as integers and not floats).

I will confirm this and try to implement ordinal_encoder_transform differently to fix. But maybe just not today.