Open rickbeeloo opened 2 months ago
So like I mentioned in the other issue, there are two approaches you can take:
X_DROP
to true for this use case, because you know the start location of your alignment in both cases (which is the start of the suffixes). Note that you can also use FREE_QUERY_END_GAPS
instead of X_DROP
. The difference is that X_DROP
allows the alignment to end before the end of the query.FREE_QUERY_START_GAPS
and FREE_QUERY_END_GAPS
to true. In this case, your query should be the short sequence and the min block size should be greater than the query length (20-27 in your case, so a min block size of 32 works). This will be slow as it involves looping over the entire reference sequence for each query. Also, it does not report all matching locations. You can set it to not compute the trace, which means that only the end position of the alignment will be given (as query_idx
and reference_idx
in the result). It is mainly useful if you want something that works without caring too much about speed.In your example, I think you swapped q
and r
. Anyways, the query_idx
and reference_idx
in the result are the end positions of the alignment. For global and X-drop alignment, the start is known to be at the beginning of the sequences. For free query start gaps, you have to compute the traceback to find the start.
This is some related code I wrote, it might be helpful: https://github.com/Daniel-Liu-c0deb0t/ANTISEQUENCE/blob/main/src/iter/match_any_reads.rs#L544
Ok I made a diagram (see https://docs.rs/block-aligner/0.5.1/block_aligner/scan_block/struct.Block.html) that should make the different alignment types more clear.
Hello Both, the explanation (diagram) are really very useful and I understand what exactly I need. Just for fun, usearch is open sourced several weeks ago, I think there you can find exactly what you need to do semi-global alignment, but in this case the alignment position (start and end) is not easy to obtains since usearch semi-global alignment mode do not report that (which is the query is hard to say in semi-global mode but usearch always assume the query sequences are the query) but the local mode, usearch_local does report them, both mode implemented X-Drop algorithm for fast database search. In your example provided, I think you need local alignment with large x-drop score (usearch -usearch_local), so that you can approximate semi-global mode which is very not easy as Daniel explains (reverse sequences et.al.). Usearch also has some banded DP but it seems it is not clear how it was done, Block Aligner is very clear about this.
Thanks,
Jianshu
Thanks! Actually Robert Edgar recently got in touch with me about Block Aligner stuff, probably for his tools.
Hey @Daniel-Liu-c0deb0t!
I have a bunch of sequences (most 20nts, few 27nts) which I want to find in a lot (millions) of longer sequences.
Initially I use a k-mer index, and find matching pairs, which I could probably extend using the block-aligner (perhaps using block aligner directly could work? not sure). I'm a bit stuck after reading https://github.com/Daniel-Liu-c0deb0t/block-aligner/issues/28. Using the k-mers as seeds I could find shared prefixes between the query and references, although a kmer might be slightly of of-course so setting something like
FREE_QUERY_START_GAPS
might help, and to terminate earlier theFREE_QUERY_END_GAPS
. However, I'm not sure when reading the docs what to set exactly to achieve this as there seem quite some exceptions to keep in mind "Note that this has a limitation: the min block size must be greater than the length of the query.". Could you provide an example that does something like this?For example, lets take:
A good alignment would be skipping the first
C
in the reference, and then aligningATGGGC
and using all theA
's as a gap.Perhaps something like this:
Giving:
But can I now know the positions in the reference? Ideally without computing the cigar as you said it's expensive.
Thanks in advance!