One hot encoding without categorical data

simonprovost commented 2 years ago

Describe the bug

My autoML workflow exports that he is performing one-hot encoding on my dataset, whereas we ensure that it is not categorical in before passing to the AutoML pipeline any dataset. So that we did binarise / one-hot encode it ourselves. How is this conceivable? Have you any hints? Perhaps our reading of the AutoML log is incorrect?

To Reproduce

Reproducing the behaviour consists of the following steps:

If you require the dataset urgently, I will have to take the time to completely anonymize it and retry the AutoML process if the error persists, before releasing the dataset to you here. That is personal information. I could not say much more than that, but it is a health-related issue.

Expected behavior

I would not like the AutoML workflow to use a single hot-encoding approach to my data, as each feature is binary.

Actual behavior, stacktrace or logfile

What motivates us to seek assistance is as follows:

data preprocessor:feature type:categorical transformer:categorical encoding: choice__': 'one hot encoding

Is that correct to claim that the workflow first applied a single hot technique to the data? If this is the case, we have a problem.

Environment and installation:

Please give details about your installation:

OS: Ubuntu 18 stable version
Is your installation in a virtual environment or conda environment?: No.
Python version: 3.8.12
Auto-sklearn version: 0.14.6

eddiebergman commented 2 years ago

Hi @simonprovost,

This is weird and something I'll investigate, I would have expected it to not perform any one hot encoding.

The dataset itself isn't neccessary but could I ask whether you are using a pandas dataframe or a numpy array with feat_types when passing them in to auto-sklearn?

Best, Eddie

simonprovost commented 2 years ago

Hi @eddiebergman ,

I appreciate your rapid response. Given your request, the following is my response:

I have printed out all of the feature types in my dataset, which indeed is a pandas.core.frame.Dataframe; all feature columns are of type float64; and the dtype is object.

I hope this helps. Kindly notify me if you want any other information. Cheers.

eddiebergman commented 2 years ago

Okay, that seems correct, I'll have to look into it but thanks for the report :)

simonprovost commented 2 years ago

That is excellent that it aided. I ran another analysis on a comparable dataset (same feature) but not on the same class variable, and lo and behold, I am having back the one hot encoded issue.

The best model's log:

{'model_number': 1422, 'loss': 0.25152998776009783, 'time(s)': 2.2693469524383545, 'experiment_time': 18000, 'params': {'balancing:strategy': 'none', 'classifier:__choice__': 'qda', 'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'fast_ica', 'classifier:qda:reg_param': 0.10022398190386521, 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'none', 'feature_preprocessor:fast_ica:algorithm': 'deflation', 'feature_preprocessor:fast_ica:fun': 'logcosh', 'feature_preprocessor:fast_ica:whiten': 'False', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.00598848554666527}, 'recall_score': 0.325, 'precision': 0.5, 'f1_score': 0.393939393939394, 'auroc': 0.7480769230769231, 'accuracy': 0.7959183673469388, 'error_rate': 20.408163265306122}

Would it be helpful if I am sharing with you the python type loop check I am performing on my data's features to ensure they are all binary typed correctly?

Cheers.

simonprovost commented 2 years ago

@eddiebergman I was curious as to if you discovered anything, by any luck? Were you be able to duplicate the issue? What may I possibly do to assist you with this?

I recently re-ran the analysis and, unfortunately, it occurred again; some were composed using a One Hot encoding method for the pre-processing phase.

Have a wonderful evening, Simon.

simonprovost commented 2 years ago

@eddiebergman I just verified that all of my datasets features types are well based float64, and they are. I am referring to datasets here in which the AutoSklearn system does not do one-hot encoding and others in which it does. Their type are exactly all the same. I feel missing something..

Cheers ✅

eddiebergman commented 2 years ago

Hi @simonprovost,

Sorry yes I did look into but I didn't reply so that's my bad. Essentially as far as and end user should be concerned, no one hot encoding happens as there are no categorical for it to encode. You data is not effected by any categorical transformers that appear, they are never applied.

The more detailed answer is that the search space is not reduced based on the types of columns introduced. As seen here, we basically define which kind of pipeline to apply to each type of column. These pipelines are then optimized for any hyperparameters they may have.

To find the hyperparameters for datapreprocessing, we query them to see if they have any.

To give some more complete information, here's the NumericalPreprocessingPipeline that handles numerical preprocessing (i.e. filling NaN's). You can scroll down to it's _get_pipeline_steps to see the steps involved and further look at them to see what hyperparameters the may have (example imputation which has only one hyperparameter.)

Now, if you're interested in how this categorical preprocessing pipeline is then effecting the search space, it's much the same process. Here are the steps for the CategoricalPreprocessingPipeline pipeline.

Out of those steps, only "category_coalescence" and "categorical_encoding" have hyperparameters. For OHEChoice), i traced it down to it having three hyperparameters, which are essentially the three components here, [encoding, no_encoding, one_hot_encoding]. So following the same pattern, I think "category_coalescene" follows the same pattern and has two hyperparameters [minorty_coalescence, no_coalescence].

Two points, OHEChoice is a bad name for the class as it could technically do OrdinalEncoding. Second, the optimization over head is relatively small. We use SMAC which is intelligent enough to pick up (given enough time) that these hyperparameters have little/no change associated with them. However it still is some overhead for the optimizer to learn that.

Sorry for the big info dump, it also serves as a future reference for when we have time to go back and fix it :) Hope it was informative.

I'll keep this open and labelled as a bug as it is a bug and has some potential performance implications.

Best, Eddie

simonprovost commented 2 years ago

Hi @eddiebergman ,

That is a fairly lengthy response. We appreciate it, and we wanted to thank you ✅

Does your response indicate that if I am certain that my input data does not contain category columns, and an OHE Choice with the component "One Hot Encoding" is passed, I should ignore it for the time being until the bug is fixed? Which also ensures that my data has not been subjected to any one hot encoding methods, is this correct? to double-check that we are on the same page.

Additionally, cheers I went through the components and their hyper parameters and was perplexed by the having "encoding" results sometimes which is simply nothing more than ordinal encoding. Additionally, may I request that I disregard any "categorical" change in the output of AutoSklearn if I am certain that the data provided does not contain categorical feature values?

Once again, we could not have asked for much better so thank you very much for the assistance !!

Best wishes, Simon.

eddiebergman commented 2 years ago

Yes, if you have no categorical data in your input then no categorical pre-processing will be applied. Even if it says it chose a categorical pre-processor, that choice means nothing as it can't apply it to anything.

The way to ensure your data is interpreted correctly is that:

If using np.ndarray data, then you have to manually specify with feat_types params to specify a categorical, otherwise we try use the dtype of the array which is almost certainly numeric.
If using a pandas dataframe, you can use the df.dtypes to check. We will treat "object", "string", "category" and "categorical" as categorical data.

I will note for any other readers in the future, we have some preliminary string processing so that note about "string" and "object" will change, however that's not relevant for this discussion.

Glad you found it helpful :)

Best, Eddie

automl / auto-sklearn