alteryx / evalml

EvalML is an AutoML library written in python.
https://evalml.alteryx.com
BSD 3-Clause "New" or "Revised" License
773 stars 86 forks source link

Oversampler: Add nullable type handling for nullable y #3974

Closed tamargrey closed 1 year ago

tamargrey commented 1 year ago

The following block of code will raise the ValueError: Unknown label type: 'unknown' error.

    import woodwork as ww
    X, y = X_y_binary
    y = ww.init_series(y, logical_type="BooleanNullable")
    sn = Oversampler()
    _ = sn.fit_transform(X, y)

This will not currently be seen in automl search because of the replace nullable types component, but we should consider adding nullable handling into the component class itself so that it can independently support nullable types.

Note this is likely related to https://github.com/alteryx/evalml/issues/3923, https://github.com/alteryx/evalml/issues/3922 , and https://github.com/alteryx/evalml/issues/3910 , which all stem from the inability of sklearn's type_of_target to assign a proper type to nullable data

tamargrey commented 1 year ago

Not fixed by updating to sklearn 1.2.1

tamargrey commented 1 year ago

As part of implementing component-specific handling for the Oversampler, we need to remove the nullable type logic in the BaseSampler's _prepare_data.

Also worth noting - this wasn't even maintaining woodwork types causing us to rerun type inference, which would be unnecessary computation and potentially cause a bug if we lost some column types that were influencing the type of sampler we chose.

tamargrey commented 1 year ago

The Oversampler's nullable type incompatibility is fixed by upgrading to sklearn 1.2.2, but we should still rmeove the nullable type logic that is now doubly unnecessary in _prepare_data