ekg / seqwish

alignment to variation graph inducer
MIT License
143 stars 19 forks source link

Differences between Seqwish and minigraph #65

Closed brettChapman closed 1 year ago

brettChapman commented 4 years ago

Hi Erik

I've just come across minigraph (https://github.com/lh3/minigraph) developed by the same developer of minimap2. I'm curious as to what the differences are between the graphs developed by Seqwish (either processing alignments from Edyeet or minimap2) and minigraph are. Have you previously compared results between test cases using your pipeline vs minigraph? Thanks.

ekg commented 4 years ago

Both are building pangenome graphs, but they have very different properties and objectives.

Seqwish is a direct representation of the sequences and alignment set you give it. It is not order dependent (at least not in a meaningful way: node IDs are order dependent but the graph structure and topology isn't). It produces a GFA which is lossless in that you can reconstruct any genome you put into it by walking the P lines encoding it's haplotypes.

Minigraph is a progressive POA model over minimap2 minimizer chains. It starts with one sequence and adds large variation on top.of it for each subsequence of each successive sequence that isn't part of a minimizer chain against the graph. The graph is an approximate, reduced, hierarchical representation of the pangenome. It is highly order dependent. It represents only large variation.

Both have their uses. You might use minigraph to collect the unique sequences in a pangenome in a graph form. In contrast, seqwish can be used to get a model of the full base-level relationships between all the genomes you put in. You could also think of embedding minigraph results in a graph induced by seqwish, to provide a hierarchical coordinate system.

It's worth noting that both methods have limitations.

In the case of seqwish, complex local structures can arise due to local alignment ambiguity. This makes it important to locally normalize the graph using MSA (smoothxg) after construction, which retains large SVs while unrolling small loops that often arise in low complexity sequence. This is why I implemented the pggb pipeline.

Minigraph does not handle VNTRs that occur in eukaryotic genomes very well. It will tend to expand them out into complex structures. I am still learning about this. It might matter if we want to align reads into the graph, because these open representations contain highly repetitive sequence which will frustrate alignment, and their exact structure might be related to the order of input of the sequences and not biologically relevant patterns.

Curiously, both methods seem to have similar runtime.

A seqwish based pipeline does depend on an initial all-yo-all alignment step that can be very expensive with large numbers of input genomes. It is possible to avoid this by making a subset of possible alignments, and it's of course possible to run it in parallel.

brettChapman commented 4 years ago

Thanks for the explanation. Based on my needs, I'll stick with Seqwish, considering I'm only looking at eukaryotic genomes, and I'm interested in both small and large variations. Minigraph sounds like a useful tool for particular situations.

AndreaGuarracino commented 1 year ago

Now seqwish's algorithm, implementation, and a few comparisons with minigraph are described in the following publication https://doi.org/10.1093/bioinformatics/btac743.