bacpop / PopPUNK

PopPUNK 👨‍🎤 (POPulation Partitioning Using Nucleotide Kmers)
https://www.bacpop.org/poppunk
Apache License 2.0
92 stars 20 forks source link

DBSCAN caching with multiple threads #19

Closed nickjcroucher closed 6 years ago

nickjcroucher commented 6 years ago

There seems to be an issue with DBSCAN fitting that only arises when multithreaded - looks to be an issue regarding joblib and multiprocessor package interfaces:

PopPUNK (POPulation Partitioning Using Nucleotide Kmers) Mode: Fitting model to reference database

Traceback (most recent call last): File "./PopPUNK/poppunk-runner.py", line 9, in main() File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/main.py", line 248, in main assignments = model.fit(distMat, args.D, args.min_cluster_prop, args.threads) File "/lustre/scratch118/infgen/team81/nc3/GPS/ST_core/PopPUNK/PopPUNK/models.py", line 316, in fit self.hdb, self.labels, self.n_clusters = fitDbScan(self.subsampled_X, self.outPrefix, min_samples, min_cluster_size, cache_out, threads) File "/lustre/scratch118/infgen/team81/nc3/GPS/STcore/PopPUNK/PopPUNK/dbscan.py", line 50, in fitDbScan ).fit(X) File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan.py", line 851, in fit self._min_spanningtree) = hdbscan(X, **kwargs) File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan.py", line 546, in hdbscan core_dist_njobs, *kwargs) File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/memory.py", line 362, in call return self.func(args, **kwargs) File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/hdbscan/hdbscan.py", line 285, in _hdbscan_boruvka_balltree n_jobs=core_dist_n_jobs, **kwargs) File "hdbscan/_hdbscan_boruvka.pyx", line 984, in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm.init File "hdbscan/_hdbscan_boruvka.pyx", line 1015, in hdbscan._hdbscan_boruvka.BallTreeBoruvkaAlgorithm._compute_bounds File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 789, in call self.retrieve() File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 699, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/lustre/scratch118/infgen/team81/nc3/nc3/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value multiprocessing.pool.MaybeEncodingError: Error sending result: '[(array([[0.00000000e+00, 4.03246668e-04, 4.05466814e-04, ..., 3.45438025e-02, 3.45439840e-02, 3.45456626e-02], [0.00000000e+00, 1.74505436e-04, 2.04864122e-04, ..., 2.66212775e-02, 2.66223376e-02, 2.66228534e-02], [0.00000000e+00, 2.33849996e-04, 2.44379060e-04, ..., 3.66543389e-02, 3.66559647e-02, 3.66573385e-02], ..., [0.00000000e+00, 7.88384237e-05, 1.32052839e-04, ..., 4.68294336e-02, 4.68303438e-02, 4.68309294e-02], [0.00000000e+00, 1.04485943e-04, 2.06512190e-04, ..., 2.64343423e-02, 2.64372834e-02, 2.64386719e-02], [0.00000000e+00, 1.87643709e-04, 2.02630717e-04, ..., 2.65259452e-02, 2.65293704e-02, 2.65309182e-02]]), array([[ 0, 21411, 61521, ..., 74665, 33889, 25600], [ 1, 89127, 69051, ..., 21044, 27497, 84593], [ 2, 41269, 85304, ..., 4793, 61086, 11021], ..., [24997, 7094, 57682, ..., 26199, 13061, 51331], [24998, 42754, 77802, ..., 96494, 7710, 7146], [24999, 77949, 18152, ..., 14254, 39465, 95775]]))]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647",)'

johnlees commented 6 years ago

I suspect this a limitation of the HDBSCAN code when there are a large number of samples and using multiprocessing, based on reading this: https://stackoverflow.com/questions/47776486/python-struct-error-i-format-requires-2147483648-number-2147483647

Was this with the default 10^5 subsamples? Perhaps DBSCAN can also give a good fit with fewer samples than this? It might be the case that the core_dist_n_jobs doesn't give much of a speedup (not sure how much of the algorithm is spent on this step), in which case we should just remove it.

nickjcroucher commented 6 years ago

Yes, this was run with the default number of samples. The issue you point out seems pretty fundamental - we could keep the caching for single threaded applications, and turn it off for multi-threading - might be worth profiling whether there is any benefit for using multiple threads in the latter case.

johnlees commented 6 years ago

Using the E. coli dataset (1509 strains, fitting 10^5 points) took 6min 29s, on either 1 or 4 threads. I would suggest just removing the multithreading based on this. Or was this an issue specifically when the cache is used, and multiple iterations of HDBSCAN are run?

johnlees commented 6 years ago

Also this made me spot an issue with ref clustering not printing that I just fixed in 3c625463a5c35d289b604139f7265237c7392e31 which is worth pulling

nickjcroucher commented 6 years ago

Great, I've started a run with a manual parameter set - would be worth checking a start_s parameter has been provided there, some idiot might only provide coordinates of means and get confused by the error messages...