Closed nickjcroucher closed 6 years ago
I suspect this a limitation of the HDBSCAN code when there are a large number of samples and using multiprocessing, based on reading this: https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647
Was this with the default 10^5 subsamples? Perhaps DBSCAN can also give a good fit with fewer samples than this?
It might be the case that the core_dist_n_jobs
doesn't give much of a speedup (not sure how much of the algorithm is spent on this step), in which case we should just remove it.
Yes, this was run with the default number of samples. The issue you point out seems pretty fundamental - we could keep the caching for single threaded applications, and turn it off for multi-threading - might be worth profiling whether there is any benefit for using multiple threads in the latter case.
Using the E. coli dataset (1509 strains, fitting 10^5 points) took 6min 29s, on either 1 or 4 threads. I would suggest just removing the multithreading based on this. Or was this an issue specifically when the cache is used, and multiple iterations of HDBSCAN are run?
Also this made me spot an issue with ref clustering not printing that I just fixed in 3c625463a5c35d289b604139f7265237c7392e31 which is worth pulling
Great, I've started a run with a manual parameter set - would be worth checking a start_s parameter has been provided there, some idiot might only provide coordinates of means and get confused by the error messages...
There seems to be an issue with DBSCAN fitting that only arises when multithreaded - looks to be an issue regarding joblib and multiprocessor package interfaces: