GaetanBenoitDev / metaMDBG

MetaMDBG: a lightweight assembler for long and accurate metagenomics reads.
MIT License
105 stars 4 forks source link

Feature request: Get a graph of final contigs #4

Open jakobnissen opened 1 year ago

jakobnissen commented 1 year ago

Thanks for creating such a nice assembler.

In the research we're doing, we're looking to exploit the relationship between contigs in the assembly graph. Several other tools, such as Recycler, GraphBin, metaplasmidSPAdes, SCAPP and CCVAE are, like us, extracting information from the relationship between contigs in the graph. The graph is a rich source of information, and I believe more tools will be working on assembly graphs in the future.

We realize that hifiasm-meta does produce a graph and so can use that, but we believe metaMDBG produces better contigs than hifiasm-meta. As mentioned in #2 , metaMDBG can produce a graph, but there is currently no way for the user to relate the graph with the actual output contigs.

Would it be possible to add functionality in metaMDBG, such that a graph of the final contigs can be produced?

GaetanBenoitDev commented 1 year ago

Thanks for using metaMDBG!

Unfortunatelly this is a limitation of the multi-k approach, it doesn't produce a nice assembly graph. The only thing that you can do is to generate intermediate graphs with the "gfa" command, this command provide the path of the final contigs in the generated graph.

jakobnissen commented 1 year ago

I won't presume to know more about the inner works of assemblers than you, but I think SPAdes uses multiple k-mers and still produce a graph. In SPAdes, I believe the information from smaller k-mers is used to construct the graph of larger k-mers, and the final contigs are extracted from the final graph only. Therefore, every output contig can be found in the final graph. If I understand your preprint correctly, something similar happens in metaMDBG: The mContigs are extracted from the final graph with the largest k-mer size, after which each mContig is converted to contigs. So, isn't there are 1-to-1 correspondance between paths in the final graph (consisting of multiple unitigs), mContigs, and contigs?

GaetanBenoitDev commented 1 year ago

The final contigs are extracted from thousand of variants of the assembly graph with different properties (overlap size, abundance filter etc). I can't provide all of them, so the gfa command generate the initial one. The only thing you can do then is to use the unitig -> contig mapping to study contig relationship.

jakobnissen commented 1 year ago

I see. That's unfortunate for us, but we'll see what we can do. Thank you for explaining.

cjfields commented 1 month ago

@GaetanBenoitDev , with rust-mdbg there is a secondary tool, magic_simplify_meta, that is used to generate the overall graph. Mentioned here and is also mentioned in the recent binning publication from Heng Li. So there is some precedent for this at least with rust-mdbg.

cjfields commented 1 month ago

@GaetanBenoitDev , with rust-mdbg there is a secondary tool, magic_simplify_meta, that is used to generate the overall graph. Mentioned here and is also mentioned in the recent binning publication from Heng Li. So there is some precedent for this at least with rust-mdbg.

For others coming along to this and seeing the above. Looking the above script over, it's not immediately apparent how magic_simplify_meta is generating a new GFA from multiple steps/GFA's. It does appear to be generated this from a single GFA instead, likely one produced by rust-mdbg and not generated here.

GaetanBenoitDev commented 1 month ago

Hi, I think you should just use the contig -> unitig mapping information, metaflye also use this system. Rust-mdbg can easily provide this graph because it is a very simple assembler. Depending on the methods used by assemblers to generate the contigs, it can be hard to provide such graph.

There are also thing which are hard to represent, lets say I have solved a species in single contig but it has strain variability. It's more natural to preserve that variability in the graph (thus a fragmented graph even for the solved strain), and represent the solved contig as a path in this graph.