david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

sample_size in isolation.forest #64

Closed franci2312 closed 1 month ago

franci2312 commented 1 month ago

Hello,

unfortunately, when using isolation.forest the following is always returned:

'sample_size' is set to the maximum when producing scores while fitting a model

This occurs both when I specify sample_size and when my dataset contains significantly more rows than 10000.

Thank you for your attention to this matter.

david-cortes commented 1 month ago

Thanks for raising this issue.

Just to be sure: is this about the R function isolation.forest? If so, are you able to provide an example where this happens?

franci2312 commented 1 month ago

Yes, it is about the R function isolation.forest. For instance, both when fitting isotree::isolation.forest(df, sample_size = 2048, ntrees = 200, output_score = T) and isotree::isolation.forest(df, ntrees = 200, output_score = T) I got the warning message and nrow(dataset) is used as sample size. However, my dataset contains 60000 rows.

This does not happen when output_score = F.

david-cortes commented 1 month ago

That is the expected behavior according to the docs: image

If you want to get predictions from a sub-sampled model, you'll first need to build the model with output_score=FALSE and then call predict on it.