BigDataBiology / SemiBin

SemiBin: metagenomics binning with self-supervised deep learning
https://semibin.rtfd.io/
115 stars 10 forks source link

KeyError when generating coverage #168

Closed Louis-MG closed 2 months ago

Louis-MG commented 2 months ago

As other people, I have an issue when trying the multi-sample binning:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin/generate_coverage.py", line 100, in generate_cov
    contig_cov, must_link_contig_cov = calculate_coverage(
                                       ^^^^^^^^^^^^^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin/generate_coverage.py", line 45, in calculate_coverage
    cov_threshold = contig_threshold_dict[sample_name]
                    ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
KeyError: 'NODE_1_length_314220_cov_67.980583'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/guelou01/miniconda3/envs/SemiBin/bin/SemiBin2", line 10, in <module>
    sys.exit(main2())
             ^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin/main.py", line 1610, in main2
    multi_easy_binning(
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin/main.py", line 1313, in multi_easy_binning
    sample_list = generate_sequence_features_multi(logger, args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/site-packages/SemiBin/main.py", line 964, in generate_sequence_features_multi
    s = r.get()
        ^^^^^^^
  File "/home/guelou01/miniconda3/envs/SemiBin/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
KeyError: 'NODE_1_length_314220_cov_67.980583'

I tried to use the --separator : option, in case the default did not work, it does not work. Command was:

SemiBin2 concatenate_fasta \
    --input-fasta "$scaffolds"/*/*_scaffolds.fasta \
    --output "$output"/

#multi-sample binning
SemiBin2 multi_easy_bin \
        -i "$output"/concatenated.fa.gz \
        -b "$alignments"/*mapped.sorted.bam  \
        -o "$output"/
        -s :

And the concatenated.fa.gz looks like:

>DRR171461_scaffolds:NODE_1_length_314220_cov_67.980583
seq

Someone else apparently found out that the problem was in the BAM or BAI but did not provide the fix.

luispedro commented 2 months ago

My guess is that some of the alignment files ("$alignments"/*mapped.sorted.bam) were not aligned against the concatenated.fa.gz file

luispedro commented 2 months ago

The next version will at least print out a more detailed error message, which would diagnose the problem faster

Louis-MG commented 2 months ago

Okay I read the documentation too fast. Seems like the mitake is recurring, maybe showing the bold title 'Generating bam' after the Generating concatenated.fa would help.

Sorry for the inconvenience.