ekg / edyeet

base-accurate DNA sequence alignments using edlib and mashmap2
MIT License
33 stars 3 forks source link

Integrate libgaba #2

Open ekg opened 4 years ago

ekg commented 4 years ago

I'd like to use adaptive banded global alignment (with affine gaps) instead of edlib. Switching between the different algorithms should be possible.

The best target for this is @ocxtal's https://github.com/ocxtal/libgaba. Given the clean setup for alignment, it's just a matter of pulling in the code.

ocxtal commented 4 years ago

Sorry but, I personally don't recommend adopting adaptive banding for computing long alignments > 1kbp. It often lose track of optimal path when it enters a repetitive region, and breaks alignment around it. If the actual computation can be a bit slow, I think it's much better adopting X-drop extension (dozeu or any other implementation).

If the computation can be refinement after extension by edlib, ksw2 would be the best for that purpose. It only changes the base-to-base correspondence so is expected to preserve rough shape of the resulting graph. Since both adaptive banding and X-drop tends to break alignments at gappy regions, I believe it's better the refeinement is done with a global algorithm, not with an extension.

Thanks,

ekg commented 4 years ago

Thanks @ocxtal, I had been thinking about this after posting it and came to a similar conclusion. But, I had still thought that it might make sense to use libgaba or crop based SW to refine the edlib alignments. Do you think even in the case of refinement it would be better to apply ksw2?

ocxtal commented 3 years ago

If banding matters in a sense of the quality of alignment, my personal conclusion is that the X-drop with anti-diagonal slicing is the best among existing algorithms. The threshold X can be interpreted as an occurrence probability ratio to the (putative) optimal path. Since it is determined based on probability, not on gap length or seed interval, the result becomes more robust and more consistent with the base pairing.

However, I'm not sure if it's good for building genome graphs. Banding algorithms tend to split alignment into smaller fragments by losing track of the optimal path. It makes it harder to interpret disjoint-but-near fragments correctly, because it's always difficult to discriminate true (main) paths from false-positive fragments. Global alignment algorithms are free from such confusion if the bandwidth is enough, since it only follows the result of the previous stage. My thought is that it's good at this moment to make edlib responsible for alignment spans and ksw2 for base-to-base correspondence, because being simple is really important.

ocxtal commented 3 years ago

And below is my, less realistic thoughts on adopting X-drop for building graphs...

There is no way to prove it right now, but I believe this nature of the X-drop algorithm is essential for estimating structural rearrangements more accurately, and building better graphs. Since the Smith-Waterman model does not express duplications, copy-number variants, or translocations, they are always interpreted as indels in the SW alignment paths. Various structural events being confused and not salvaged results in sometimes-merged and sometimes-separated paths in the output graphs. I know there is no simple answer for annotating structural events only from alignment paths, but at least a set of paths should have enough information to discriminate them. All alignment paths being maximal, not interfered by incoming or outgoing gaps, is important as one of its requirements. In this interpretation, the X-drop threshold functions as a demarcation point for the SW model and structural classification algorithms. Gaps of which penalty is smaller than the threshold is always interpreted as true indels, and the others are estimated to be one of several events, always by taking into account the positional relationship with the surrounding paths. It is important again that the X-drop threshold can be interpreted on a probabilistic model so that the results produced by this method are reasonably interpretable.

If we think about such things, it's important not to miss short alignment paths around long ones. However, this idea is the opposite of current major algorithms that make computation more efficient by actively filtering out such short (locally or distantly repetitive) fragments. I'm curious how to balance them for building graphs better, but it's beyond my thought for now...😔

ekg commented 3 years ago

I am looking into using ksw2. I guess I will run it along the edlib alignment path, varying the bandwidth and xdrop as suggested by the alignment path. It seems straightforward enough for a test.

For pangenome construction, it's often simple enough to require that a contiguous match is at least k-bp long before considering it. In effect, edlib-based global alignment generates regions where you're unlikely to get k-bp of matches. This is not ideal, and can generate ragged edges around SVs, but it results in highly-variable or structurally-variable regions turning into bubbles in the resulting graph. This is done with seqwish's -k parameter:

      -k[N], --min-match-len=[N]        Filter exact matches below this length.
                                        This can smooth the graph locally and
                                        prevent the formation of complex local
                                        graph topologies from forming due to
                                        differential alignments.

Like the xdrop parameter, k sets a kind of probabilistic bound on the identity between the sequences for them to be brought together in the graph.

To clean up and refine the bubbles, it seems to be enough to run a MSA across the graph on sets of subsequences that co-localize in the graph. This can also ensure that the resulting graph is locally directed and acyclic. Using affine gap global alignment won't obviate the need to do this, because we'll still get local looping structures (we do when using minimap2), but it might reduce the amount of work needed by making the initial graph closer to our objective. (This is what smoothxg does.)