david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

pickle extended model takes uses 30gigs of ram #15

Closed tmontana closed 3 years ago

tmontana commented 3 years ago

Hi. Thanks for sharing this great library.

When I pickle.dump a trained model with 200 trees memory usage exploded. Extended forest model (trained on 1mm rows and 350 columns) increases memory usage by over 30 gigs and 5 gigs on disk when finished. Note that I am using pickle protocol = 5 (with python=3.8.5). Using an earlier protocol crashed my machine (90gigs of ram) due to memory usage so was not able to pickle at all.

By contrast pickling sci-forest does not increase mem usage significantly and only takes up 10% of the disk space when done.

Is that the expected behavior? Anything I can do to optimize memory used? Many thanks,

david-cortes commented 3 years ago

It uses cython's auto-pickle functionality. I don't know the exacts of how it works, and am not sure if that's expected. Nevertheless, there's also the option of using the package's own serialization funcionality with use_cpp=True (export_model / import_model), which should definitely not increase memory usage by that much.

tmontana commented 3 years ago

Indeed! Problem solved. It's also a lot faster. Thank you,