genotoul-bioinfo / Binette

A fast and accurate binning refinement tool to constructs high quality MAGs from the output of multiple binning tools.
https://binette.readthedocs.io
MIT License
16 stars 1 forks source link

Binette error: bin length and N50 #11

Closed schmigle closed 1 month ago

schmigle commented 3 months ago

Looks like it's using some kind of length information for the dictionary keys.

binette --bin_dirs semibin/output_bins rosella_bins/bins metacoag/bins --contigs flye/assembly.fasta -v --threads 28

[2024-05-07 13:48:25] INFO - Program started
[2024-05-07 13:48:25] INFO - command line: /home/u30/moshesteyn/miniforge3/envs/binette/bin/binette --bin_dirs semibin/output_bins rosella_bins/bins metacoag/bins --contigs flye/assembly.fasta -v --threads 28
[2024-05-07 13:48:25] INFO - Parsing bin directories.
[2024-05-07 13:48:25] INFO - 3 bin sets processed:
[2024-05-07 13:48:25] INFO -  semibin/output_ - 2 bins
[2024-05-07 13:48:25] INFO -  rosella_bins/ - 91 bins
[2024-05-07 13:48:25] INFO -  metacoag/ - 2 bins
[2024-05-07 13:48:25] INFO - Parsing contig fasta file: flye/assembly.fasta
[2024-05-07 13:48:25] INFO - Predicting cds sequences with Pyrodigal using 28 threads.
[2024-05-07 13:48:40] INFO - Writing predicted protein sequences.
[2024-05-07 13:48:46] INFO - Running diamond
[2024-05-07 13:48:46] INFO - diamond blastp --outfmt 6 --max-target-seqs 1 --query results/temporary_files/assembly_proteins.faa -o results/temporary_files/diamond_result.tsv --threads 28 --db /groups/baltrus/moshesteyn/checkm2_db/CheckM2_database/uniref100.KO.1.dmnd --query-cover 80 --subject-cover 80 --id 30 --evalue 1e-05 --block-size 2 2> results/temporary_files/diamond_result.log
[2024-05-07 13:49:25] INFO - Finished Running DIAMOND
[2024-05-07 13:49:25] INFO - Parsing diamond results.
[2024-05-07 13:49:25] INFO - Compute cds metadata.
[2024-05-07 13:49:25] INFO - Collecting contig amino acid composition using 28 threads.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15149/15149 [00:01<00:00, 12787.91it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15149/15149 [00:00<00:00, 506081.24contig/s]
[2024-05-07 13:49:29] INFO - Calculating amino acid composition in parallel.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15149/15149 [00:00<00:00, 1186146.79contig/s]
[2024-05-07 13:49:29] INFO - Calculating total amino acid length in parallel.
[2024-05-07 13:49:29] INFO - Add size and assess quality of input bins
[2024-05-07 13:49:29] INFO - Assess bin length and N50
Traceback (most recent call last):
  File "/home/u30/moshesteyn/miniforge3/envs/binette/bin/binette", line 10, in <module>
    sys.exit(main())
  File "/home/u30/moshesteyn/miniforge3/envs/binette/lib/python3.8/site-packages/binette/main.py", line 343, in main
    bin_quality.add_bin_metrics(original_bins, contig_metadat, args.contamination_weight, args.threads)
  File "/home/u30/moshesteyn/miniforge3/envs/binette/lib/python3.8/site-packages/binette/bin_quality.py", line 159, in add_bin_metrics
    add_bin_size_and_N50(bins, contig_to_length)
  File "/home/u30/moshesteyn/miniforge3/envs/binette/lib/python3.8/site-packages/binette/bin_quality.py", line 131, in add_bin_size_and_N50
    lengths = [contig_to_size[c] for c in bin_obj.contigs]
  File "/home/u30/moshesteyn/miniforge3/envs/binette/lib/python3.8/site-packages/binette/bin_quality.py", line 131, in <listcomp>
    lengths = [contig_to_size[c] for c in bin_obj.contigs]
KeyError: 9229
JeanMainguy commented 3 months ago

Hello, Yes it seems a bit weird.

Are you sure that all contigs in your different binning results are in the assembly file and with the same name? If the names do not match between bins and assembly, binette will not be able to work. I am not sure I added the adequate error if this happens.

schmigle commented 3 months ago

I think they are, DAS_tool runs fine on the dataset and relies on the same principle. However, I'll check and update later.

JeanMainguy commented 3 months ago

I was able to reproduce the error when a contig from a bin is not found in the provided contig file. The numerical ID 9229 you see in the error occurs because contig names are temporarily replaced by indexes in the code to save memory.

Binette has some checks on contig consistency but somehow misses this particular scenario. I'll work on that and improve the error handling to make it clearer. Thanks for reporting this error.

JeanMainguy commented 1 month ago

Hello, I've improved the error handling in version 1.0.1. Binette should now clearly indicate that contigs are missing in the contigs files if mismatches are found between bin tables and contig file. I'm closing this issue for now, but please feel free to reopen it if you think the contig name mismatch wasn't the root cause of the problem. Best