Unexpected behavior in tabular feature transformation

nabenabe0928 commented 3 years ago

There are two unexpected behavior in the tabular_feature_validator.py:

0 in categorical viewed as NaN if there is NaN in a column
A bug from sklearn

For both behaviors, I used the following test function:

from autoPyTorch.data.tabular_feature_validator import TabularFeatureValidator
import pandas as pd
import numpy as np

def test(rows):
    df = pd.DataFrame(rows, dtype='category')
    validator = TabularFeatureValidator()
    validator.fit(df)
    transformed_df = validator.transform(df)
    print(transformed_df)

The first issue is reproduced by:

rows = [
    {'A': np.nan, 'B': 1},
    {'A': np.nan, 'B': 0},
    {'A': np.nan, 'B': np.nan}
]
test(rows)

### Out ###
[[0. 1.]
 [1. 0.]
 [1. 0.]]

### Expected ###
[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]

The second issue is reproduced by:

rows = [
    {'A': np.nan, 'B': np.nan},
    {'A': 1, 'B': 1},
    {'A': np.nan, 'B': np.nan}
]
test(rows)

### Out ###
TypeError

### Expected ###
[[1. 0. 1. 0.]
 [0. 1. 0. 1.]
 [1. 0. 1. 0.]]

ravinkohli commented 2 years ago

To have all the information here, this issue is in the refactor_development_regularization_cocktails branch.

ravinkohli commented 2 years ago

Hey, actually due to the recent changes in preprocessing logic, this issue is not relevant anymore. Now, autoPyTorch detects all_nan_columns and converts them to numerical to be handled later in the pipeline and the encoding has been shifted back to ordinal.

automl / Auto-PyTorch

Unexpected behavior in tabular feature transformation #294