Closed simonprovost closed 2 years ago
Hi @simonprovost,
This is weird and something I'll investigate, I would have expected it to not perform any one hot encoding.
The dataset itself isn't neccessary but could I ask whether you are using a pandas dataframe or a numpy array with feat_types
when passing them in to auto-sklearn
?
Best, Eddie
Hi @eddiebergman ,
I appreciate your rapid response. Given your request, the following is my response:
I have printed out all of the feature types in my dataset, which indeed is a pandas.core.frame.Dataframe
; all feature columns are of type float64
; and the dtype is object
.
I hope this helps. Kindly notify me if you want any other information. Cheers.
Okay, that seems correct, I'll have to look into it but thanks for the report :)
That is excellent that it aided. I ran another analysis on a comparable dataset (same feature) but not on the same class variable, and lo and behold, I am having back the one hot encoded issue.
The best model's log:
{'model_number': 1422, 'loss': 0.25152998776009783, 'time(s)': 2.2693469524383545, 'experiment_time': 18000, 'params': {'balancing:strategy': 'none', 'classifier:__choice__': 'qda', 'data_preprocessor:__choice__': 'feature_type', 'feature_preprocessor:__choice__': 'fast_ica', 'classifier:qda:reg_param': 0.10022398190386521, 'data_preprocessor:feature_type:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'mean', 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'none', 'feature_preprocessor:fast_ica:algorithm': 'deflation', 'feature_preprocessor:fast_ica:fun': 'logcosh', 'feature_preprocessor:fast_ica:whiten': 'False', 'data_preprocessor:feature_type:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.00598848554666527}, 'recall_score': 0.325, 'precision': 0.5, 'f1_score': 0.393939393939394, 'auroc': 0.7480769230769231, 'accuracy': 0.7959183673469388, 'error_rate': 20.408163265306122}
Would it be helpful if I am sharing with you the python type loop check I am performing on my data's features to ensure they are all binary typed correctly?
Cheers.
@eddiebergman I was curious as to if you discovered anything, by any luck? Were you be able to duplicate the issue? What may I possibly do to assist you with this?
I recently re-ran the analysis and, unfortunately, it occurred again; some were composed using a One Hot encoding method for the pre-processing phase.
Have a wonderful evening, Simon.
@eddiebergman I just verified that all of my datasets features types are well based float64
, and they are. I am referring to datasets here in which the AutoSklearn system does not do one-hot encoding and others in which it does. Their type are exactly all the same. I feel missing something..
Cheers ✅
Hi @simonprovost,
Sorry yes I did look into but I didn't reply so that's my bad. Essentially as far as and end user should be concerned, no one hot encoding happens as there are no categorical for it to encode. You data is not effected by any categorical transformers that appear, they are never applied.
The more detailed answer is that the search space is not reduced based on the types of columns introduced. As seen here, we basically define which kind of pipeline to apply to each type of column. These pipelines are then optimized for any hyperparameters they may have.
To find the hyperparameters for datapreprocessing, we query them to see if they have any.
To give some more complete information, here's the NumericalPreprocessingPipeline
that handles numerical preprocessing (i.e. filling NaN's). You can scroll down to it's _get_pipeline_steps
to see the steps involved and further look at them to see what hyperparameters the may have (example imputation which has only one hyperparameter.)
Now, if you're interested in how this categorical preprocessing pipeline is then effecting the search space, it's much the same process. Here are the steps for the CategoricalPreprocessingPipeline
pipeline.
Out of those steps, only "category_coalescence"
and "categorical_encoding"
have hyperparameters. For OHEChoice
), i traced it down to it having three hyperparameters, which are essentially the three components here, [encoding, no_encoding, one_hot_encoding]
. So following the same pattern, I think "category_coalescene"
follows the same pattern and has two hyperparameters [minorty_coalescence, no_coalescence]
.
Two points, OHEChoice
is a bad name for the class as it could technically do OrdinalEncoding. Second, the optimization over head is relatively small. We use SMAC which is intelligent enough to pick up (given enough time) that these hyperparameters have little/no change associated with them. However it still is some overhead for the optimizer to learn that.
Sorry for the big info dump, it also serves as a future reference for when we have time to go back and fix it :) Hope it was informative.
I'll keep this open and labelled as a bug as it is a bug and has some potential performance implications.
Best, Eddie
Hi @eddiebergman ,
That is a fairly lengthy response. We appreciate it, and we wanted to thank you ✅
Does your response indicate that if I am certain that my input data does not contain category columns, and an OHE Choice with the component "One Hot Encoding" is passed, I should ignore it for the time being until the bug is fixed? Which also ensures that my data has not been subjected to any one hot encoding methods, is this correct? to double-check that we are on the same page.
Additionally, cheers I went through the components and their hyper parameters and was perplexed by the having "encoding" results sometimes which is simply nothing more than ordinal encoding. Additionally, may I request that I disregard any "categorical" change in the output of AutoSklearn if I am certain that the data provided does not contain categorical feature values?
Once again, we could not have asked for much better so thank you very much for the assistance !!
Best wishes, Simon.
Yes, if you have no categorical data in your input then no categorical pre-processing will be applied. Even if it says it chose a categorical pre-processor, that choice means nothing as it can't apply it to anything.
The way to ensure your data is interpreted correctly is that:
np.ndarray
data, then you have to manually specify with feat_types
params to specify a categorical, otherwise we try use the dtype
of the array which is almost certainly numeric.df.dtypes
to check. We will treat "object", "string", "category" and "categorical" as categorical data.I will note for any other readers in the future, we have some preliminary string processing so that note about "string" and "object" will change, however that's not relevant for this discussion.
Glad you found it helpful :)
Best, Eddie
Describe the bug
My autoML workflow exports that he is performing one-hot encoding on my dataset, whereas we ensure that it is not categorical in before passing to the AutoML pipeline any dataset. So that we did binarise / one-hot encode it ourselves. How is this conceivable? Have you any hints? Perhaps our reading of the AutoML log is incorrect?
To Reproduce
Reproducing the behaviour consists of the following steps:
Expected behavior
I would not like the AutoML workflow to use a single hot-encoding approach to my data, as each feature is binary.
Actual behavior, stacktrace or logfile
What motivates us to seek assistance is as follows:
Is that correct to claim that the workflow first applied a single hot technique to the data? If this is the case, we have a problem.
Environment and installation:
Please give details about your installation: