david-cortes / isotree

(Python, R, C/C++) Isolation Forest and variations such as SCiForest and EIF, with some additions (outlier detection + similarity + NA imputation)
https://isotree.readthedocs.io
BSD 2-Clause "Simplified" License
186 stars 38 forks source link

Not able to run a loop in parallel using joblib #31

Closed vsahil closed 3 years ago

vsahil commented 3 years ago

Thank you for creating such a fantastic repository which is so easily accessible and with such great documentation. I have a question. I have a for loop in which I do predictions using the trained isolation forest. When I was using a different anomaly detection approach, I was able to run that loop in parallel using joblib, but when I switched to using isotree, the parallelization doesn't happen. When I execute it, it just does nothing and stays as it is (whenever n_jobs > 1, it only runs when n_jobs =1).

Any clue why this is happening and how can we parallelize that loop when using isotree?

david-cortes commented 3 years ago

Joblib has different parallelization strategies. I think the default one implies using pickle to serialize the objects between processes, and it the models are big, that step will take a long time. Try using a different parallelization strategy, ideally involving process forking if you are in some OS other than windows.

Besides that, isotree already parallelizes the predictions, so you will only slow it down from using joblib. If you really want to use it with joblib you'd also have to set nthreads=1 to avoid nested parallelism.

vsahil commented 3 years ago

Okay, that makes a lot of sense. Thank you. Closing this issue.