genotoul-bioinfo / Binette

A fast and accurate binning refinement tool to constructs high quality MAGs from the output of multiple binning tools.
https://binette.readthedocs.io
MIT License
16 stars 1 forks source link

Error 258 #2

Closed gabrieleghiotto closed 6 months ago

gabrieleghiotto commented 1 year ago

Hy, I was trying for the first time the software however it is giving this error:

binette --extension fa --threads 10 --min_completeness 50 --outdir refined/ --contig2bin_tables genomes.stb --contigs bins/maxbin2.002.fa 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 449/449 [00:00<00:00, 4746.22it/s] 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 449/449 [00:00<00:00, 428789.28it/s] multiprocessing.pool.RemoteTraceback:

Traceback (most recent call last):
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/site-packages/binette/bin_quality.py", line 125, in get_bin_size_and_N50
    lengths = [contig_to_size[c] for c in bin_obj.contigs]
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/site-packages/binette/bin_quality.py", line 125, in <listcomp>
    lengths = [contig_to_size[c] for c in bin_obj.contigs]
KeyError: 258

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/bioinfo/anaconda3/envs/binette/bin/binette", line 8, in <module>
    sys.exit(main())
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/site-packages/binette/binette.py", line 254, in main
    bin_quality.add_bin_metrics(original_bins, contig_info, contamination_weight, threads)
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/site-packages/binette/bin_quality.py", line 160, in add_bin_metrics
    pool.starmap(get_bin_size_and_N50, bin_and_contigsize_args)
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/bioinfo/anaconda3/envs/binette/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 258
JeanMainguy commented 1 year ago

Hello,

Could you try running the binette command with the verbose parameter -v. This will provide more detailed output, which can help in troubleshooting the problem.

For example with the two following contig2bin_tables:

contig_1    binA
contig_8    binA
contig_15   binB
contig_9    binC

The binette command would be:

binette --outdir refined/ --contig2bin_tables bin_set1.tsv bin_set2.tsv --contigs assembly.fasta

The assembly.fasta file should contain at least the 5 contigs mentionned in the contig2bin_tables files: contig_1, contig_8, contig_15, contig_9, contig_10

Alternatively, instead of using the --contig2bin_tables argument, you have the option to specify the input bin sets by using the --bin_dirs argument. This argument requires at least two bin folders, each containing the individual bins in FASTA file format.

gabrieleghiotto commented 1 year ago

Thanks, I managed to solve the problem

gabrieleghiotto commented 1 year ago

I was asking myself if after the refinement with binette is necessary to dereplicate the corrected MAGs with software like dRep or DAStool?

JeanMainguy commented 1 year ago

Hello, Binette is a refinement tool, similar to DASTool. Therefore, it is not necessary to use DASTool after Binette. dRep dereplicate bins by selecting a single representative version of highly similar bins. This can be especially useful when working with multiple samples to generate a unified set of dereplicated bins common to all the samples but it really depends on what you want to obtain at the end.

gabrieleghiotto commented 1 year ago

Thanks, I would also like to ask you about the parameter for contamination weight. Let's say that I am interested in recovering the high quality MAGs, meaning > 90 of completeness and < 5 of contamination. Did you perform some benchmarks and if so, which is the value you suggest? Thanks in advance.

JeanMainguy commented 12 months ago

Hello,

Sorry for the delay in responding. I've conducted some benchmarks about the contamination weight, although not extensively. From what I observed, the parameter doesn't have a huge impact on the number of high-quality bins. The influence of this parameter tends to vary based on your dataset and your tolerance for some level of contamination in the resulting bins. You can reduce the contamination weight, if you're comfortable with a certain amount of contamination in the final bins.

It's likely that the default contamination weight of 5 is set relatively high, if you want to maximize the number of MAGs with completeness > 90 and contamination < 5 as it can penalize legit bins that have a minor degree of contamination. To illustrate, a bin with a completeness of 100 and contamination of 5 would score 75 (100 - 5*5), whereas a bin with a completeness of 85 and contamination of 1 would score 80 (85 - 5*1). The final bin selection is determined by these scores.

If you're interested, you can experiment with different contamination weight values without the need to recompute everything. This can be done using the --resume flag.

Assuming you've previously run binette and the results are stored in the directory results, you can follow these steps: Create a new directory, say results_contamination_weight_2.

mkdir results_contamination_weight_2

Next, create a symbolic link of the temporary_files directory from the original results directory to your new results_contamination_weight_2 directory.

ln -s ../results/temporary_files results_contamination_weight_2/

Now, launch binette with the --contamination_weight argument along with the --resume flag and –outdir set to the new result directory results_contamination_weight_2. This will enable binette to reuse the existing Diamond and Prodigal results that it finds in the output directory and so significantly speeding up the process.

binette --contig your_contigs.fasta --bin_dirs binsA/ binsB/ binsC \
             --outdir results_contamination_weight_2  --contamination_weight 2 \
             --resume --outdir results_contamination_weight_2

If you come across any intriguing discoveries, please do share them. :-)

If you have further questions or need additional clarification, please don't hesitate to ask.

Jean

gabrieleghiotto commented 12 months ago

Thank you very much for the exhaustive answer. Our goal is to find an alternative and better way of refining the results of the binning since we are having a lot of troubles with the completeness of hydrogenotrophich archaea. We got disappointing results both with straight dereplication using dRep on bins retrieved from maxbin2, vamb3, metabat1 and metabat2, but also after refinement with DAS_Tool or MetaWrap. More into detail, we were not able to reach a completeness above 75-85% for some key species, and in the case of metatrascriptomics or metabolic modelling studies this can be a huge problem, since you are missing the information about genes of interest. I will try to perform some trials on a couple of metagenomics data that we have in house and update you. For sure our goal is to maximize as you sad the number of MAGs with completeness > 90 and contamination < 5. Up to now the only strategy that improved the original result has been mapping all the bins retrieved with the aforementioned binning software, extract only the mapping reads and perform the assembly a second time, followed by binning and dereplication. However, this strategy is time consuming, thus were are trying to look to other refinement software available. I will let you know :)

Best, Gabriele

cho7-31 commented 12 months ago

Hello, perhaps another option would be to look at contigs that are affiliated with your species / type of interest without going as far as binning? There is a huge loss of information between assembly and binning, and many contigs are lost because they are not binned or are in bins of too poor quality. Or if you really want genomes, you can also refine the bins by hand via anvi'o. Good luck to you!