"Clustering using HDBSCAN running" step dows not complete

crastr commented 2 years ago

Hi @anuradhawick!

We managed to launch LRBinner in docker, but the "Clustering using HDBSCAN running" step ended with the following mistake.

docker run --rm -it --gpus '"device=3"' -v pwd:pwd -u id -u:id -g anuradhawick/lrbinner contigs -r $PWD/c1.fq -c $PWD/c1.fasta --k-size 4 --cuda --output $PWD/result

Output: 2021-12-10 17:35:54,303 - INFO - Command /usr/LRBinner/LRBinner contigs -r /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/c1.fq -c /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/c1.fasta --k-size 4 -t 40 --cuda --output /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/result --resume 2021-12-10 17:35:57,360 - INFO - CUDA found in system 2021-12-10 17:35:57,362 - INFO - Resuming the program from previous checkpoints 2021-12-10 17:35:57,363 - INFO - Loading contig lengths 2021-12-10 17:35:57,485 - INFO - Loading marker genes from previous computations 2021-12-10 17:38:00,783 - INFO - Contigs already split 2021-12-10 17:38:00,783 - INFO - 15-mer counting already performed 2021-12-10 17:38:00,783 - INFO - K-mer vectors already computed 2021-12-10 17:38:00,783 - INFO - Coverage vectors already computed 2021-12-10 17:38:01,196 - INFO - Numpy arrays already computed 2021-12-10 17:38:01,196 - INFO - VAE already trained 2021-12-10 17:38:01,248 - INFO - Clustering using HDBSCAN running joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker call_item = call_queue.get(block=True, timeout=timeout) File "/opt/conda/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.setstate File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper File "stringsource", line 349, in View.MemoryView.memoryview.cinit ValueError: buffer source array is read-only """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/usr/LRBinner/LRBinner", line 197, in main() File "/usr/LRBinner/LRBinner", line 179, in main pipelines.run_contig_binning(args) File "/usr/LRBinner/mbcclr_utils/pipelines.py", line 242, in run_contig_binning cluster_utils.perform_contig_binning_HDBSCAN( File "/usr/LRBinner/mbcclr_utils/cluster_utils.py", line 494, in perform_contig_binning_HDBSCAN labels = HDBSCAN(min_cluster_size=250).fitpredict(latent) File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 941, in fitpredict self.fit(X) File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 919, in fit self._min_spanningtree) = hdbscan(X, **kwargs) File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 610, in hdbscan (single_linkage_tree, result_min_spantree) = memory.cache( File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 349, in call return self.func(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 275, in _hdbscan_boruvka_kdtree alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric, File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds File "/opt/conda/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in call self.retrieve() File "/opt/conda/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/opt/conda/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 445, in result return self.get_result() File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 390, in get_result raise self._exception joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

The same line works perfectly on 10% of contigs and 10% of reads.

Quick googling showed that the possible problem could be connected to the number of rows (like in https://githubmemory.com/repo/scikit-learn/scikit-learn/issues/21228).

Thanks in advance! Alexey

anuradhawick commented 2 years ago

Thanks for the issue. I will look into this. I may update the docker file accordingly and reply to you in a few days time.

acalchera commented 2 years ago

Hi @anuradhawick!

We are having the same problem clustering using hdbscan but we are using conda instead of docker. Did you have time to look into the issue yet?

Thank you very much in advance! Anjuli

2022-02-10 14:29:40,021 - INFO - Command /home/woo/tools/LRBinner/LRBinner contigs --reads-path reads.fasta --bin-count 10 --bin-size 32 --output microbiome_bins --k-size 3 --ae-dims 4 --ae-epochs 200 --threads 20 --contigs scaffolds.1Kb.fa --resume
2022-02-10 14:29:40,035 - INFO - Resuming the program from previous checkpoints
2022-02-10 14:29:40,044 - INFO - Loading contig lengths
2022-02-10 14:29:40,199 - INFO - Loading marker genes from previous computations
2022-02-10 14:31:11,071 - INFO - Contigs already split
2022-02-10 14:31:11,071 - INFO - 15-mer counting already performed
2022-02-10 14:31:11,072 - INFO - K-mer vectors already computed
2022-02-10 14:31:11,072 - INFO - Coverage vectors already computed
2022-02-10 14:31:13,264 - INFO - Numpy arrays already computed
2022-02-10 14:31:13,264 - INFO - VAE already trained
2022-02-10 14:31:14,670 - INFO - Clustering using HDBSCAN running
joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/woo/tools/LRBinner/LRBinner", line 197, in <module>
    main()
  File "/home/woo/tools/LRBinner/LRBinner", line 179, in main
    pipelines.run_contig_binning(args)
  File "/home/woo/tools/LRBinner/mbcclr_utils/pipelines.py", line 243, in run_contig_binning
    output, fragment_parent, separate, contigs, threads)
  File "/home/woo/tools/LRBinner/mbcclr_utils/cluster_utils.py", line 494, in perform_contig_binning_HDBSCAN
    labels = HDBSCAN(min_cluster_size=250).fit_predict(latent)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 941, in fit_predict
    self.fit(X)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 615, in hdbscan
    core_dist_n_jobs, **kwargs)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 278, in _hdbscan_boruvka_kdtree
    n_jobs=core_dist_n_jobs, **kwargs)
  File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
  File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
    self.retrieve()
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

anuradhawick commented 2 years ago

Hi Anjuli,

thanks for the issue.

can you tell me how you installed packages using conda? I need to know command to install hdbscan.

Thanks.

acalchera commented 2 years ago

Hi @anuradhawick!

Thank you for your reply! We used the following command as suggested in the README.md

conda create -n lrbinner -y python=3.7 numpy scipy seaborn h5py tabulate pytorch hdbscan gcc openmp tqdm biopython

conda activate lrbinner

git clone https://github.com/anuradhawick/LRBinner.git
cd LRBinner/
python setup.py build

Thank you very much for looking into it. We are eager to use your tool on our data!

Cheers, Anjuli

anuradhawick commented 2 years ago

Hi @4njul1 and @crastr,

Could you please try to install HDBSCAN using the command,

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

There are issues in the conda version and it is not the latest version.

Let me know if this helps,

~Anuradha

acalchera commented 2 years ago

Hi @anuradhawick!,

thanks a lot for looking into this. I tried to upgrade HDBSCAN with the command you posted and it works!

Thank you very much for your help!

Best wishes, Anjuli

anuradhawick commented 2 years ago

@4njul1 fantastic.

Please let me know how the tool performs, any artefacts and feedback when you have time.

Thanks Anuradha.

anuradhawick / LRBinner

"Clustering using HDBSCAN running" step dows not complete #5