Unexpected behaviour during imputations with all NA features - Githubissues

david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)

https://isotree.readthedocs.io

BSD 2-Clause "Simplified" License

186 stars 38 forks source link

Unexpected behaviour during imputations with all NA features #52

Closed tufanbt closed 1 year ago

tufanbt commented 1 year ago

While using IsolationForest for imputation, although training data is all na for a feature (so no imputation can be done), transformed dataset (imputed test dataset which is nonoverlapping with the training, and also all na for that feature) includes mostly zeros(~93%) and some na values for the same feature. I could not replicate the issue with a smaller dataset, but maybe this description could help detect the problem. For reference, my training and test dataset have shapes (400000, 1000) and there are 3 categorical features with 10 to 40 levels. To sum up, IsolationForest's transform method introduces some zeros to "un-imputable" features.

david-cortes commented 1 year ago

Thanks for the bug report. Couple questions:

In which kind of columns (dense, sparse, categorical) are you seeing these zeros? Are you passing this as data frames or as matrices?
Are you by any chance doing some cross-validation schema that would leave some fold in which every column has either only missing values or only one unique non-missing values?
Are you using the R or the python package?
What kind of hyperparameters are you passing? (particularly for ndim).

tufanbt commented 1 year ago

My input data is pandas DataFrame.
All my columns are dense in the sense that I do not use sparse matrices etc., but the columns I see zeros are all NA's in both train and test dataset.
I do not use any cross-validation schema, and there is no all NA observations or something similar to this, as I know there are some features that does not contain NA's.
I am using Python package.
Here are the parameters: build_imputer=True, min_imp_obs=1, max_depth=None, min_gain=0.25, sample_size=0.5, ntrees=100, ndim=2, prob_pick_pooled_gain=1, ntry=10

david-cortes commented 1 year ago

Thanks for the information.

Quick questions:

Does this still happen under the last version of this library (0.5.20.post3)?
If so, do you at any point observe some message like in your other bug report about double-free, memory corruption, stack invalidation, etc.? (if you are using jupyter-notebook, these might appear in the terminal that launched jupyter-notebook rather than in the output cells).

david-cortes commented 1 year ago

Yet another question: does you data by some chance have duplicated column names?

david-cortes commented 1 year ago

Actually turns out there was indeed an issue with imputing categorical columns with zeros in some cases when they are all missing. I've pushed a small update that should fix it - could you give it a try and see if you still experience the same issue?

pip install -U git+https://github.com/david-cortes/isotree.git

david-cortes commented 1 year ago

And also another fix for numerical columns being imputed with zeros when they are all-missing.

tufanbt commented 1 year ago

Thanks for all your efforts! Your fixes did work, now I see all NA values for features which were all NA in training set. I am closing this issue.