Daniel-Liu-c0deb0t / block-aligner

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
https://crates.io/crates/block_aligner
MIT License
124 stars 7 forks source link

semi-global alignment #28

Closed jianshu93 closed 4 months ago

jianshu93 commented 4 months ago

hello Daniel,

Finally I need to use block aligner for real-world metagenomic applications. I have long reads from PacBio (1k to 20k), I want to align them in some pairs and they can be highly similar, e.g., > 90% sequence identity but only for the overlapped region, meaning, 2 sequences might only have 50% aligned, and the aligned region is very high identity, which means I need semi-global alignment, I need to know the aligned length (so that I can calculate the alignment ratio). It would be nice if I can also detect a little bit below 90% identity, e.g., 85%. I do not recall where block aligner can do semi-global alignment, that is the gaps opened at both ends will not be penalized (or a very small penalty score like in vsearch --all_pairs_global, but those low penalty gaps will not be in the final alignment). Also read the adaptive banded aligner paper, still not sure how to compute the aligned ratio (e.g., aligned positions/query length). I guess both adaptive banded DB and block aligner can be adapted to compute the semi-global alignment alignment ratio and identity for the aligned region? This is what needed in real world application. Any suggestions? Let me know if I am not clear.

Thanks,

Jianshu

Daniel-Liu-c0deb0t commented 4 months ago

Do you have a seeding process? It would be slow to do semi-global alignment without seeding. Block Aligner supports some options for allowing free gaps at the start or the end of the "query" sequence (see the docs here). However, for these to work well the block size needs to be as large as the entire DP matrix. This is probably too slow for reads up to 20kbp. This is a general limitation of exact semi-global DP aligners. However, if you have seeds, then it is possible do X-drop DP extension to the left and right of the seed location. X-drop extension tells you the location with the highest score reachable from an alignment that starts from the top-left corner of the DP matrix. Doing this twice, once to the left and once to the right of the seed location will give you an interval in the first read that aligns to an interval in the second read. Then, you can take the length of the interval and divide it by the length of your reads to get the aligned fraction.

jianshu93 commented 4 months ago

hello Daniel,

I can but I do not know which one is longer, query or reference sequences (It can be either, X-drop assumes the region of interest is shorter than the reference sequence (So blast, local), correct me if I misunderstood). Currently, I rely on vsearch all pair (semi)global, which is full dynamic programming, no seeding or adaptive banded DP, which is slow. Just curious whether the adaptive banded paper (Suzuki 2017) can be applied efficiently to replace vsearch. Another very interesting thing I see is that, in practice, semi-global alignment is more useful (sequences from real world data can be any length, any identity for prokaryotes) than global but no current paper solves this, e.g. A*, Edlib all focus on edit but not affine gap penalty functions, then the vsearch is full DP, no banded, the closest I saw is Ren 2024 BSAlign, but the semi-global mode is strange, or just wrong when benchmarking the code with vsearch (truth known), not the one I want (that is what was achieved in vsearch all Pairs global).

Jianshu

Daniel-Liu-c0deb0t commented 4 months ago

I think you might be misunderstanding the X-drop mode. Imagine a 2D DP matrix. X-drop computes a path that must start in the top-left corner (the starts of the sequences), and this path can end anywhere where the score is maximized. It essentially aligns the prefix between two sequences. It does not matter whether the query or the reference is longer.

The reason Block Aligner (and other DP aligners) implements this is because almost all aligners (minimap2, etc) use this twice to extend a seed to the left and to the right, obtaining an interval that aligns between two sequences. This is much more efficient than DP semiglobal alignment (even with SIMD) for long sequences, which is O(n^2) time and you cannot apply banding for semiglobal alignment. So, I suggest finding seed matches between reads first with minimizers, then extending those seed matches with X-drop DP. Seeding also allows you to avoid all-pairs comparison, because reads that are too dissimilar will not have seed matches, so you don't have to align them.

If you want to compute the full DP matrix for semiglobal alignment without seeding, then an older library like SSW probably supports that. However, people don't use this directly anymore because its too slow compared to seed-and-extend for long sequences.

jianshu93 commented 4 months ago

Thanks. This is clear, in that case, my problem is essentially a long reads mapping problem. I will probably need seed-chain-extension twice to obtain an interval for either query or reference, depending on which is longer.

Jianshu

Daniel-Liu-c0deb0t commented 4 months ago

I don't understand why you need to seed-chain-extend twice. It should not matter which is longer, one of the sequences will just be soft-clipped. If you are writing this code yourself, you can skip the chaining part (just pick a few seeds to extend) to make it easier to implement.

jianshu93 commented 4 months ago

Hi Daniel,

I followed your suggestion to just use seed-extension via X-Drop. The problem I have now is:

AGGGCGGATGA------- --------GGATGAGTGCA

AGGGCGGATGA ---GGCGGA---

The bold font is the alignment I want, both libgaba, bsalign do trace back from the highest score, back to the initial position (0,0), which mean the left end is closed but the right end is open. In the case for X-drop, when do I stop traceback to obtain the alignment I want, or it is just determined by experiment. Do you know a library that take care of the above 2 cases (actually several cases depending on the length of sequence A and B).

Thanks,

Jianshu