Open crastr opened 2 years ago
Thanks for the issue. I will look into this. I may update the docker file accordingly and reply to you in a few days time.
Hi @anuradhawick!
We are having the same problem clustering using hdbscan but we are using conda instead of docker. Did you have time to look into the issue yet?
Thank you very much in advance! Anjuli
2022-02-10 14:29:40,021 - INFO - Command /home/woo/tools/LRBinner/LRBinner contigs --reads-path reads.fasta --bin-count 10 --bin-size 32 --output microbiome_bins --k-size 3 --ae-dims 4 --ae-epochs 200 --threads 20 --contigs scaffolds.1Kb.fa --resume
2022-02-10 14:29:40,035 - INFO - Resuming the program from previous checkpoints
2022-02-10 14:29:40,044 - INFO - Loading contig lengths
2022-02-10 14:29:40,199 - INFO - Loading marker genes from previous computations
2022-02-10 14:31:11,071 - INFO - Contigs already split
2022-02-10 14:31:11,071 - INFO - 15-mer counting already performed
2022-02-10 14:31:11,072 - INFO - K-mer vectors already computed
2022-02-10 14:31:11,072 - INFO - Coverage vectors already computed
2022-02-10 14:31:13,264 - INFO - Numpy arrays already computed
2022-02-10 14:31:13,264 - INFO - VAE already trained
2022-02-10 14:31:14,670 - INFO - Clustering using HDBSCAN running
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker
call_item = call_queue.get(block=True, timeout=timeout)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/woo/tools/LRBinner/LRBinner", line 197, in <module>
main()
File "/home/woo/tools/LRBinner/LRBinner", line 179, in main
pipelines.run_contig_binning(args)
File "/home/woo/tools/LRBinner/mbcclr_utils/pipelines.py", line 243, in run_contig_binning
output, fragment_parent, separate, contigs, threads)
File "/home/woo/tools/LRBinner/mbcclr_utils/cluster_utils.py", line 494, in perform_contig_binning_HDBSCAN
labels = HDBSCAN(min_cluster_size=250).fit_predict(latent)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 941, in fit_predict
self.fit(X)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 919, in fit
self._min_spanning_tree) = hdbscan(X, **kwargs)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 615, in hdbscan
core_dist_n_jobs, **kwargs)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/memory.py", line 349, in __call__
return self.func(*args, **kwargs)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/hdbscan/hdbscan_.py", line 278, in _hdbscan_boruvka_kdtree
n_jobs=core_dist_n_jobs, **kwargs)
File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__
File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
self.retrieve()
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/home/woo/miniconda3/envs/lrbinner/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
Hi Anjuli,
thanks for the issue.
can you tell me how you installed packages using conda? I need to know command to install hdbscan.
Thanks.
Hi @anuradhawick!
Thank you for your reply! We used the following command as suggested in the README.md
conda create -n lrbinner -y python=3.7 numpy scipy seaborn h5py tabulate pytorch hdbscan gcc openmp tqdm biopython
conda activate lrbinner
git clone https://github.com/anuradhawick/LRBinner.git
cd LRBinner/
python setup.py build
Thank you very much for looking into it. We are eager to use your tool on our data!
Cheers, Anjuli
Hi @4njul1 and @crastr,
Could you please try to install HDBSCAN using the command,
pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan
There are issues in the conda version and it is not the latest version.
Let me know if this helps,
~Anuradha
Hi @anuradhawick!,
thanks a lot for looking into this. I tried to upgrade HDBSCAN with the command you posted and it works!
Thank you very much for your help!
Best wishes, Anjuli
@4njul1 fantastic.
Please let me know how the tool performs, any artefacts and feedback when you have time.
Thanks Anuradha.
Hi @anuradhawick!
We managed to launch LRBinner in docker, but the "Clustering using HDBSCAN running" step ended with the following mistake.
docker run --rm -it --gpus '"device=3"' -v
pwd
:pwd
-uid -u
:id -g
anuradhawick/lrbinner contigs -r $PWD/c1.fq -c $PWD/c1.fasta --k-size 4 --cuda --output $PWD/resultOutput: 2021-12-10 17:35:54,303 - INFO - Command /usr/LRBinner/LRBinner contigs -r /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/c1.fq -c /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/c1.fasta --k-size 4 -t 40 --cuda --output /mnt/40_tb_10/work/alex/other_labs/andronov/andornov_metag_2021/complete_polished/c1/result --resume 2021-12-10 17:35:57,360 - INFO - CUDA found in system 2021-12-10 17:35:57,362 - INFO - Resuming the program from previous checkpoints 2021-12-10 17:35:57,363 - INFO - Loading contig lengths 2021-12-10 17:35:57,485 - INFO - Loading marker genes from previous computations 2021-12-10 17:38:00,783 - INFO - Contigs already split 2021-12-10 17:38:00,783 - INFO - 15-mer counting already performed 2021-12-10 17:38:00,783 - INFO - K-mer vectors already computed 2021-12-10 17:38:00,783 - INFO - Coverage vectors already computed 2021-12-10 17:38:01,196 - INFO - Numpy arrays already computed 2021-12-10 17:38:01,196 - INFO - VAE already trained 2021-12-10 17:38:01,248 - INFO - Clustering using HDBSCAN running joblib.externals.loky.process_executor._RemoteTraceback: """ Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 407, in _process_worker call_item = call_queue.get(block=True, timeout=timeout) File "/opt/conda/lib/python3.9/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.setstate File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper File "stringsource", line 349, in View.MemoryView.memoryview.cinit ValueError: buffer source array is read-only """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/usr/LRBinner/LRBinner", line 197, in
main()
File "/usr/LRBinner/LRBinner", line 179, in main
pipelines.run_contig_binning(args)
File "/usr/LRBinner/mbcclr_utils/pipelines.py", line 242, in run_contig_binning
cluster_utils.perform_contig_binning_HDBSCAN(
File "/usr/LRBinner/mbcclr_utils/cluster_utils.py", line 494, in perform_contig_binning_HDBSCAN
labels = HDBSCAN(min_cluster_size=250).fitpredict(latent)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 941, in fitpredict
self.fit(X)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 919, in fit
self._min_spanningtree) = hdbscan(X, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 610, in hdbscan
(single_linkage_tree, result_min_spantree) = memory.cache(
File "/opt/conda/lib/python3.9/site-packages/joblib/memory.py", line 349, in call
return self.func(*args, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/hdbscan/hdbscan.py", line 275, in _hdbscan_boruvka_kdtree
alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
File "hdbscan/_hdbscan_boruvka.pyx", line 392, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.init
File "hdbscan/_hdbscan_boruvka.pyx", line 426, in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds
File "/opt/conda/lib/python3.9/site-packages/joblib/parallel.py", line 1056, in call
self.retrieve()
File "/opt/conda/lib/python3.9/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/opt/conda/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 445, in result
return self.get_result()
File "/opt/conda/lib/python3.9/concurrent/futures/_base.py", line 390, in get_result
raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
The same line works perfectly on 10% of contigs and 10% of reads.
Quick googling showed that the possible problem could be connected to the number of rows (like in https://githubmemory.com/repo/scikit-learn/scikit-learn/issues/21228).
Thanks in advance! Alexey