EMBL-PKU / BASALT

MIT License
76 stars 13 forks source link

a question about dRep and the source of MAGs #26

Open liupfskygre opened 2 months ago

liupfskygre commented 2 months ago

Hi, basalt developer,

 BASALT is great. I have two technique questions. 

1, what is the threshold of dRep used in the pipeline? In the paper and the read.me, we could read dRep strategy were used,

In the readme, it is written: """"Comparatively, prominent binning tools such as metaWRAP 1 and DASTool 2 only support single assembly file as input, where multiple binning processes are required if there are multiple assembly files in a dataset. Moreover, redundant bins generated under SA + CA mode need to be removed using dereplication tools such as dRep 3. """

In the method of the NC paper, it is written: """" In the Best-bins grouping program, ANI is first calculated between each bin pair in the hybrid binsets. Bins at ANI ≥ 99% and AF ≥ 50% are grouped for further bin dereplication.""" so the output of MAGs from BASALT were dRep with an ANI 99% and AF 50%? Is this right?

2, if I have two or three contigs with their raw reads for BASALT, could still tell which MAGs comes from which contigs (related to a specific samples ) or BASALT will merge all contigs and give MAGs without information about which samples it came from. For example, sample1.contigs.fa, sample2.contigs.fa,sample3.contigs.fa, sample1_r1.fa,sample1_r2.fa,sample2_r1.fa,sample2_r2.fa,sample3_r1.fa,sample3_r2.fa, what will the output look like? have MAGs for sample1, sample2, and sample2 separately or have a single output file with all MAGs merged and pulled?

Thanks

noddevil4949 commented 2 months ago

Hi,

Thanks for your question.

  1. BASALT does not use dRep for dereplication. We use our own script to identify potential redundant bins based on the CSI algorithm. In the readme file, we compared with other tools such as metaWRAP or DASTool, suggesting a further dereplication is necessary in the SA + CA mode for binning when using these tools. In our paper, we group bins at ANI ≥ 99% and AF ≥ 50% to further process the bin dereplication.

  2. You can trace back which contig comes from which assembly file in the intermediate results, such as bins produced after bin selection step in "BestBinset" folder, where the name of contigs in bins will match the intermediate fasta file (such as 1_sample1.contigs.fa, 2_sample2.contigs.fa, with contigs renamed as >1-1, >1-2, >1-3, etc.). However, as BASALT incorporates a reassembly step, contigs in bins will be renamed after this step because it has been reassembled. Therefore, you cannot trace back to the original contig in your final output.

Hope the above explanations make sense, and please let me know if you have any further questions.

Thanks for using BASALT!