BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
107 stars 10 forks source link

SemiBin2 fails to generate bins for large assemblies #166

Closed apcamargo closed 5 hours ago

apcamargo commented 3 weeks ago

I'm trying to bin a couple of assemblies with SemiBin2 (v2.1.0), using the single_easy_bin command. For some assemblies, the job finishes early and no bins are generated:

[2024-06-11 10:41:43,492] INFO: Setting number of CPUs to 64
[2024-06-11 10:41:43,492] INFO: Binning for short_read
[2024-06-11 10:41:43,495] INFO: SemiBin will run in self supervised mode
[2024-06-11 10:41:49,295] INFO: Did not detect GPU, using CPU.
[2024-06-11 10:42:01,482] INFO: Generating training data...
[2024-06-11 10:49:31,759] INFO: Calculating coverage for every sample.
[2024-06-11 11:21:08,692] INFO: Processed: mapping_binning/B_1.bam
[2024-06-11 11:21:08,694] INFO: Processed: mapping_binning/B_2.bam
[2024-06-11 11:21:57,939] INFO: Processed: mapping_binning/B_3.bam
[2024-06-11 11:22:34,630] INFO: Start training from a single sample.
[2024-06-11 11:22:42,504] INFO: Training model...
[2024-06-11 12:13:52,844] INFO: Training finished.
[2024-06-11 12:13:52,909] INFO: Start binning.

It seems that this only affects large assemblies, as the runs for small assemblies finished without an issue, while the large assemblies failed. The contigs in our assemblies are ≥1 kb and we mapped the reads with strobealign and sorted them. Everything we are doing is standard, except that we are running SemiBin2 through Apptainer.

apptainer pull semibin.sif docker://quay.io/biocontainers/semibin:2.1.0--pyhdfd78af_0

apptainer exec semibin.sif SemiBin2 single_easy_bin \
    -i binning_assemblies/${SAMPLE}.fna.gz \
    -b mapping_binning/${SAMPLE}/*.bam \
    -o semibin2_output/${SAMPLE}

I suspect that this might be an issue with memory that happens because there's too much data. We will try to run it again setting --min-len 2500, assuming that it will ignore shorter contigs and generate network inputs only for the contigs longer than the threshold. I will update the issue if there are any developments.

This might be related to https://github.com/BigDataBiology/SemiBin/issues/150. I decided to open another issue because my jobs finished without generating any bins, instead of hanging indefinitely.

apcamargo commented 3 weeks ago

Increasing --min-len did fix the issue. My guess is that this issue was due to lack of memory.