question on block aligner score and lowest identity allowed to be accurate

jianshu93 commented 1 year ago

Hello Daniel,

I have a question about the global alignment score of block aligner, is it metric (https://en.wikipedia.org/wiki/Metric_space)? Especially whether score (a, b) = score (b, a) for global alignment.

Also, what is the lowest identity block aligner can handle, I saw the protein example you have down to ~50% identity, how about even lower and for DNA since it is not full dynamic programming. Wavefront is fast only when 2 sequences are highly identical like, 95% identity or above. I am wondering whether this is also the case for block aligner.

Another question is for semi-global alignment, both vsearch and usearch use a much smaller penalty for gaps at both ends of the 2 sequences compare to gaps in the sequences to approximate "semi-global" because in practice semi-global does not penalize gaps at both ends.

Thanks,

Jianshu

Daniel-Liu-c0deb0t commented 1 year ago

Block aligner isn't guaranteed to be a metric. I think the only issue is that choosing whether to shift right or down is arbitrary when there is a tie in the scores. This can be resolved if you sort the strings right before feeding them into block aligner (so the order is always consistent).

If you want to use block aligner for very low sequence identity alignments, you should use a large initial block size. The larger the block size, the more like full DP alignment, the more likely it is able to be a metric and not produce wrong results. It will be slower the lower the sequence identity, but compared to other algorithms it should still be pretty fast due to SIMD acceleration. If you use a fixed block size (eg., 256-256), then block aligner will behave pretty similarly to banded alignment.

Block aligner supports position specific gap costs, but only for one of the two sequences you are trying to align (sequence to PSSM mode).

jianshu93 commented 1 year ago

Thank you! This is clear. Therefore, block aligner cannot be used for hnsw unless we use very large block size, which is no different to SIMD full dynamic programming.

Thanks again, Jianshu

Daniel-Liu-c0deb0t commented 1 year ago

If you have the chance to, I highly suggest you to try it! I've observed speedups even with a large block size compared to full DP (Farrar's algorithm). What's the length of your sequences and the distribution of error rates for your data?

jianshu93 commented 1 year ago

Hello Daniel,

I will definitely try it. I am having difficulties finding a mathematical prove that (semi)global alignment (semi, begining and end of sequence gap not penalized) is metric (edit distance is metric for sure) with affine gap penalty function. Did you realize anything that is related to this?

Thanks,

Jianshu

Daniel-Liu-c0deb0t commented 1 year ago

Counterexample for semi-global alignment (assume unit costs):

a = CAATT
b = AATT
c = GAATT

Then, d(a, b) = 0, d(b, c) = 0, but d(a, c) = 1, so d(a, b) + d(b, c) < d(a, c). Therefore, semi-global alignment is not a metric.

Can you use DP alignment after finding candidate matches with the HNSW graph to get base-level alignments (you would still use kmer methods for building/querying the HNSW graph)? Then you don't have to worry about the metric requirement for DP. You can always filter candidates returned by HNSW graph search by using the DP alignment score.

jianshu93 commented 1 year ago

It is a good idea！but the problem is kmer based diatance for short sequences like 200bp or so will not be accurate for identity smaller than 70%. So 2 pair of different sequences with 50% and 55% identity for example is no different in hnsw. Thus the navigation will be significantly less efficient and so is recall.

Unless it is huge sequences like we did for genomes，which will always share some kmers so distance can be used though not accurate enough because graph works in this case. Then we can use base alignment based tool for accurate estimation.

Key is that sequences in the database must share something to be able to build an efficient graph. We cannot use a very small kmer like 3 because a sequence could be no different from a random sequence:

k has to be restricted to large values k'≥log_(|Σ|) (N). when, N Is 200 or ave, k is at least 4 or 5, otherwise we will have underestimation of the distance. but if kmer is too larger, For a random mutation rate r∈(0,1), the probability that a kmer belonging to sequence X and Y is not mutated is 1-(1-r)^k≈1-exp(-r/k) , indicating that k≲r^(-1), so k must be small enough to capture the mutation and also big enough to avoid underestimation. Therefore, kmer based method cannot be used to estimate distance for divergent sequences.

Does this make sense to you？

Jianshu

Daniel-Liu-c0deb0t / block-aligner

question on block aligner score and lowest identity allowed to be accurate #18