a-h-b / binny

GNU General Public License v3.0
27 stars 6 forks source link

Binny step is slow and use a lot of RAM in large dataset #44

Open kingtom2016 opened 1 year ago

kingtom2016 commented 1 year ago

I am runing Binny on 18 metagenomes with average depth 20gbps with 20 cores. This step has run 12 days and not complete yet. It also required a lot of RAM (more than 300GB) How can I speed up this step and use less RAM? like keeping coassembly mode off?

ohickl commented 1 year ago

Hi,

could you post the log (path/to/outdir/logs/binning_binny.log) and your config file so we can have a look at your setup and the runs progress?

Best Oskar

kingtom2016 commented 1 year ago

Here is my log and config file content:

06/01/2023 12:56:50 PM - Starting Binny run for sample test. 06/01/2023 01:30:36 PM - Looking for single contig bins. 06/01/2023 01:42:15 PM - Found 0 single contig bins. 06/01/2023 01:42:15 PM - Calculating N90 06/01/2023 01:44:43 PM - N90 is 547, with scMAGs would be 547. 06/01/2023 02:06:23 PM - Masking potentially disruptive sequences from k-mer counting. 06/01/2023 02:09:43 PM - Calculating k-mer frequencies of sizes: 2, 3, 4.

NX_value: 90 bin_quality: min_completeness: 50 purity: 90 start_completeness: 92.5 clustering: hdbscan_epsilon_range: 0.250,0.000 hdbscan_min_samples_range: 1,5,10 include_depth_initial: 'False' include_depth_main: 'False' coassembly_mode: auto conda_source: '' db_path: '' distance_metric: manhattan embedding: max_iterations: 50 extract_scmags: 'True' kmers: 2,3,4 mantis_env: SemiBin mask_disruptive_sequences: 'True' max_cont_length_cutoff: 2250 max_cont_length_cutoff_marker: 2250 max_marker_lineage_depth_lvl: 2 max_n_contigs: 5.0e5 mem: big_mem_avail: 100 big_mem_per_core_gb: 26 normal_mem_per_core_gb: 16 min_cont_length_cutoff: 2250 min_cont_length_cutoff_marker: 2250 outputdir: tmp/binny_results prokka_env: SemiBin raws: assembly: ASSEMBLY/final_assembly.fasta contig_depth: '' metagenomics_alignment: ASSEMBLY/*sort.bam sample: test sessionName: TESTRUN_3919871335 snakemake_env: SemiBin tmp_dir: tmp write_contig_data: 'True'

ohickl commented 1 year ago

Thanks.

Something definitely went wrong. Calculating the k-mer frequencies should not take that long. Can you check if there are actually processes running? How many sequences are in the assembly? You could try sub-sampling it to a small amount, e.g. 50k sequences, to see if it runs at all and if its something with the system.

kingtom2016 commented 1 year ago

Actually, I have tested Binny in other relaitvely small dataset and the system works well. When running on the large dataset (each sample has totally 1Gbps assembly sequences >500bps generated by megahit), the system also looks normally before Binny step (top command showed a python process running with CPU and RAM resources). I guess large assembly sequence may lead to this RAM and speed problem?

ohickl commented 1 year ago

Could be, i am still puzzled as to why it would stall at the k-mer counting but not throw any error. I will try to do some tests to see , if i can reproduce it. You could try to limit the assembly to e.g. 1*10^6 or 5*10^5 sequences, if it is much above that to see if that helps in the meantime.

kingtom2016 commented 1 year ago

I notice the warning info: '/mnt/g/stone_meta/software/binny/conda/631d6f5983d746bd3b67fe54d30e5f94/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py:702: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak." Is this a clue?