TutteInstitute / fast_hdbscan

A fast multi-core implementation of HDBSCAN for low dimensional Euclidean spaces
BSD 2-Clause "Simplified" License
78 stars 8 forks source link

Process gets killed over 50M points #18

Open mrp3anut opened 6 months ago

mrp3anut commented 6 months ago

I am facing an issue where if I try to cluster more then 50M points my kernel/python process dies. And weirdly it doesn't seem to be a memory issue. I monitored the memory usage throughout and it never exceeded %50. Because the process is killed I also don't get an error stack I can provide. Any ideas?

lmcinnes commented 6 months ago

That is weird, and I'm not sure what the best way to debug that is -- potentially it is dying inside of the compiled numba code and that's why it just fails out. In an ideal world you might be able to try running it with numba off (set the environment variable NUMBA_DISABLE_JIT to 1), but potentially it will be so slow at that point that you won't get to your crash. It might be worth a try anyway, just in case.

mrp3anut commented 6 months ago

So I ran the code as a script and got an exit code 139. It is a segmentation error which is probably like you mentioned related to the numba part. Maybe due to some set variable types and limits?

mrp3anut commented 6 months ago

So I managed to track down where the error occurs. It is the eom_recursion function in cluster_trees.py. My guess is it is hitting a recursion limit. Switching to cluster_selection_method='leaf' seems to work. And increasing recursion depth also might solve the issue for eom.

lmcinnes commented 6 months ago

Good catch -- yes, it is certainly possible that with that many points you are hitting a recursion limit; it might be internal to numba however, which might make it harder to remedy. If the python recursion limit setting isn't enough I would suggest trying to reach out to the numba team (they are usually pretty responsive on gitter) and see if they have ideas.

You can also try increasing the min_cluster_size which would simplify the tree a little.