Hydro3639 / NanoPhase

Reference-quality genome reconstruction from complex metagenomes (or bacterial isolates) using only Nanopore long reads or both long and short reads (hybrid strategy)
MIT License
24 stars 1 forks source link

Abundance of reconstructed genomes #2

Closed bcpd closed 1 year ago

bcpd commented 1 year ago

In addition to MAG recovery, I am also interested in estimating the abundance/coverage of each genome (similar to what is done [e.g.] via CoverM). Your paper reads: "long reads were mapped to the above draft bins using minimap2 (v2.21-r1071; map-ont) with at least 90% identity and 90% coverage, producing draft-bin-based clusters [...]". I am assuming this should be similar to what I'm looking for (in fact, CoverM uses minimap2), but I would be grateful if you could confirm this.

On a (somewhat) related note, I am curious as to your decision not to dereplicate the MAGs.

Hydro3639 commented 1 year ago

Did you mean the coverage information of each contig provided in the recovered genome? If yes, the coverage information was based on the flye output. I have checked the minimap2 results using the parameters you mentioned before, they are very similar. If you want to calculate the coverage of each MAG (mapped bases/genome size), you can sum up the bases that are mapped on all contigs (in each MAG) either by 1) minimap2 results or 2) based on the contig coverage information provided in the recovered MAGs (sum up all contigs: length * coverage).

For your second question, I am not sure what you mean, "not to dereplicate the MAGs". For this pipeline, not necessary to do the dereplication as we did not recover the genome in different ways.

bcpd commented 1 year ago

Thank you. Yes, I understand how to do the calculation; I simply cannot find the relevant results. Could you point out where to find the minimap2 results and the contig coverage information? Thanks very much.

Hydro3639 commented 1 year ago

If you want to use the minimap2 result, I am sorry that you need to do it again; but if you want to use contig coverage information, just check the bin file, like grep '^>' bin.x.fa; then I think you can find it, which should be like bin.236_contig_131103 length=227223 cov=67 circular=N.

bcpd commented 1 year ago

Thank you. Sorry, I forgot the second part. By "dereplication" I meant clustering MAGs at a certain similarity threshold and picking a representative, using (e.g.) FastANI or dRep.

Hydro3639 commented 1 year ago

As I mentioned before, you do not need to dereplicate these recovered MAGs. If you have many sequencing datasets from different samples, you recover genomes from each sample individually and want to collect representative ones, then you were suggested to do the dereplication. Otherwise, if you just performed binning process based on assembled contigs (from one sample, or many samples but they were assembled together), no meaning to doing the dereplication; because contigs are non-redundant.

bcpd commented 1 year ago

Great, thank you. Understood. The first scenario (sequencing data-sets from a few or many samples followed by per-sample genome recovery) is what describes my case (and probably that of most people who do metagenomics). While dereplicating and quantification are fairly straightforward, it would be really helpful to include both as default outputs. Thanks again.