david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

memory issues with build_imputer=True #51

Closed tufanbt closed 1 year ago

tufanbt commented 1 year ago

I use IsolationForest for imputation in a for loop, with a sliding window for time series data. Here is a small example code:

for quarter_start in quarter_starts:
  #some code here
  imputer = IsolationForest(build_imputer=True, min_imp_obs=1, max_depth=None, min_gain=0.25, sample_size=0.5, 
                                  ntrees=100, ndim=2, prob_pick_pooled_gain=1, ntry=10)
  imputer.fit(subset_train[subset_train.columns[2:]])
  subset_imputed = imputer.transform(subset_test[subset_test.columns[2:]])
  #some more code here
  gc.collect()

My problem is, although I overwrite imputer object in for loop, memory usage adds up in each iteration. Outside the for loop,

del imputer
gc.collect()

also does not free any memory. I am talking about 400 GBs of memory used for one iteration, so I cannot think about any other source of the usage except imputer object. Is there any other way to make sure memory is released? Python version: 3.9.16 OS: Ubuntu 22.04.2 LTS

david-cortes commented 1 year ago

Thanks for the bug report. If you were using the version of this library that was uploaded yesterday to PyPI, there was a bad bug that would make code that builds more than one model either crash or produce near-random results. Could you confirm whether the issue is still present with the latest version of this library, uploaded today (0.5.20.post3) ?:

pip install isotree==0.5.20.post3
tufanbt commented 1 year ago

I found a workaround using a for loop in a bash command invoking python script for each iteration instead of the for loop inside python script. As the commit you referenced above, I got some interesting error messages at the end of each script run, although the code worked fine and printed and saved some outputs as expected. The error messages were like some "double-free" stuff and "segmentation fault (core dumped)". Each script built just one model, should I worry about their output quality (like could those be near-random results)?

david-cortes commented 1 year ago

To clarify: are you still seeing these error messages with version 0.5.20.post3 (as opposed to 0.5.20.post2)?

Regarding the reliability of the models - if at any point you see a message about double-free or memory corruption in any library, then yes, there is some chance that whatever the model outputs could be random noise.

tufanbt commented 1 year ago

For now, there does not seem to be such error messages anymore. With the for loop, my memory usage swings between 580-620 GBs during iterations which was at ~35GB's just before starting for loop. So, I am assuming imputer part is eating up all that memory after training with the size of my data and some categorical features. So, here is the final question: How can I reset the memory usage of the IsolationForest instance without killing the kernel (on Jupyter) or ending the process (while running .py scripts)?

david-cortes commented 1 year ago

I don't know. This library uses cython which internally should call a method __dealloc__ at some point when the object is garbage collected. I think that's likely to happen when you delete the object and then call gc.collect(), but am not sure if it's guaranteed to happen every time you call the GC manually or if it follows some other heuristic, or whether cython itself holds of the call to __dealloc__ for later or not.

tufanbt commented 1 year ago

Well then, my problem has nothing to do with this library, but maybe its dependencies (as far as I understand). I am closing the issue.

tufanbt commented 1 year ago

Here is a more tangible problem: imputer.drop_imputer() kills jupyter lab kernel without any messages in notebook or terminal running jupyter. That can be related to my problem, as this is seemingly what documentation suggests me to clear memory usage of imputer. Reopening the issue.

david-cortes commented 1 year ago

Yes, thanks for pointing this out. Should be solved now:

pip install -U git+https://github.com/david-cortes/isotree.git

Also, if I understood your problem, you are monitoring memory consumption through some external tool after fitting the model - in such case, even dropping the imputer might not show a large difference, since memory consumed and freed by a process is not released back to the OS in its entirety unless you're using some system like FreeBSD, or unless you're LD_PRELOAD'ing libjemalloc.so or similar. But in such case, that memory consumed by the process should not increase as more objects are created, as it will reuse what it had previously requested.

tufanbt commented 1 year ago

I am using htop on a Linux machine. That’s fine if it will reuse it as you suggested, and my recent experience validates your suggestion. Closing the issue (for good i hope 😄)