david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

Unexpected behaviour during imputations with all NA features #52

Closed tufanbt closed 1 year ago

tufanbt commented 1 year ago

While using IsolationForest for imputation, although training data is all na for a feature (so no imputation can be done), transformed dataset (imputed test dataset which is nonoverlapping with the training, and also all na for that feature) includes mostly zeros(~93%) and some na values for the same feature. I could not replicate the issue with a smaller dataset, but maybe this description could help detect the problem. For reference, my training and test dataset have shapes (400000, 1000) and there are 3 categorical features with 10 to 40 levels. To sum up, IsolationForest's transform method introduces some zeros to "un-imputable" features.

david-cortes commented 1 year ago

Thanks for the bug report. Couple questions:

tufanbt commented 1 year ago
david-cortes commented 1 year ago

Thanks for the information.

Quick questions:

david-cortes commented 1 year ago

Yet another question: does you data by some chance have duplicated column names?

david-cortes commented 1 year ago

Actually turns out there was indeed an issue with imputing categorical columns with zeros in some cases when they are all missing. I've pushed a small update that should fix it - could you give it a try and see if you still experience the same issue?

pip install -U git+https://github.com/david-cortes/isotree.git
david-cortes commented 1 year ago

And also another fix for numerical columns being imputed with zeros when they are all-missing.

tufanbt commented 1 year ago

Thanks for all your efforts! Your fixes did work, now I see all NA values for features which were all NA in training set. I am closing this issue.