Advice on setting for short query in long reference search

rickbeeloo commented 2 months ago

Hey @Daniel-Liu-c0deb0t!

I have a bunch of sequences (most 20nts, few 27nts) which I want to find in a lot (millions) of longer sequences.

Initially I use a k-mer index, and find matching pairs, which I could probably extend using the block-aligner (perhaps using block aligner directly could work? not sure). I'm a bit stuck after reading https://github.com/Daniel-Liu-c0deb0t/block-aligner/issues/28. Using the k-mers as seeds I could find shared prefixes between the query and references, although a kmer might be slightly of of-course so setting something like FREE_QUERY_START_GAPS might help, and to terminate earlier the FREE_QUERY_END_GAPS. However, I'm not sure when reading the docs what to set exactly to achieve this as there seem quite some exceptions to keep in mind "Note that this has a limitation: the min block size must be greater than the length of the query.". Could you provide an example that does something like this?

For example, lets take:

query: ATGGGC
reference: CATGGGCAAAAAAAAAAAA

A good alignment would be skipping the first C in the reference, and then aligning ATGGGC and using all the A's as a gap.

Perhaps something like this:

let min_block_size = 32;
let max_block_size = 256;

// A gap of length n will cost: open + extend * (n - 1)
let gaps = Gaps { open: -2, extend: -1 };

// Note that PaddedBytes, Block, and Cigar can be initialized with sequence length
// and block size upper bounds and be reused later for shorter sequences, to avoid
// repeated allocations.
let r = PaddedBytes::from_bytes::<NucMatrix>(b"ATGGGC", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"CATGGGCAAAAAAAAAAAA", max_block_size);

// Align with traceback, with x-drop
let mut a = Block::<true, true, true, false, false>::new(q.len(), r.len(), max_block_size);
a.align(&q, &r, &NW1, gaps, min_block_size..=max_block_size, 1000000000);
let res: AlignResult = a.res();

println!("Result: {:?}", res);

let mut cigar = Cigar::new(res.query_idx, res.reference_idx);
// Compute traceback and resolve =/X (matches/mismatches).
a.trace().cigar_eq(&q, &r, res.query_idx, res.reference_idx, &mut cigar);

println!("{}", cigar.to_string())

Giving:

Result: AlignResult { score: 6, query_idx: 7, reference_idx: 6 }
6=

But can I now know the positions in the reference? Ideally without computing the cigar as you said it's expensive.

Thanks in advance!

Daniel-Liu-c0deb0t commented 2 months ago

So like I mentioned in the other issue, there are two approaches you can take:

Use your kmers to find candidate positions in the query and reference that match. Then you can use Block Aligner with X-drop to align (to the right) starting at each position (take the suffix of the two sequences that start at that position). To align to the left, you can reverse your sequences first, then take the suffixes, and then call Block Aligner again. You only need to set X_DROP to true for this use case, because you know the start location of your alignment in both cases (which is the start of the suffixes). Note that you can also use FREE_QUERY_END_GAPS instead of X_DROP. The difference is that X_DROP allows the alignment to end before the end of the query.
Just use Block Aligner, but set both FREE_QUERY_START_GAPS and FREE_QUERY_END_GAPS to true. In this case, your query should be the short sequence and the min block size should be greater than the query length (20-27 in your case, so a min block size of 32 works). This will be slow as it involves looping over the entire reference sequence for each query. Also, it does not report all matching locations. You can set it to not compute the trace, which means that only the end position of the alignment will be given (as query_idx and reference_idx in the result). It is mainly useful if you want something that works without caring too much about speed.

Daniel-Liu-c0deb0t commented 2 months ago

In your example, I think you swapped q and r. Anyways, the query_idx and reference_idx in the result are the end positions of the alignment. For global and X-drop alignment, the start is known to be at the beginning of the sequences. For free query start gaps, you have to compute the traceback to find the start.

This is some related code I wrote, it might be helpful: https://github.com/Daniel-Liu-c0deb0t/ANTISEQUENCE/blob/main/src/iter/match_any_reads.rs#L544

Daniel-Liu-c0deb0t commented 2 months ago

Ok I made a diagram (see https://docs.rs/block-aligner/0.5.1/block_aligner/scan_block/struct.Block.html) that should make the different alignment types more clear.

jianshu93 commented 2 months ago

Hello Both, the explanation (diagram) are really very useful and I understand what exactly I need. Just for fun, usearch is open sourced several weeks ago, I think there you can find exactly what you need to do semi-global alignment, but in this case the alignment position (start and end) is not easy to obtains since usearch semi-global alignment mode do not report that (which is the query is hard to say in semi-global mode but usearch always assume the query sequences are the query) but the local mode, usearch_local does report them, both mode implemented X-Drop algorithm for fast database search. In your example provided, I think you need local alignment with large x-drop score (usearch -usearch_local), so that you can approximate semi-global mode which is very not easy as Daniel explains (reverse sequences et.al.). Usearch also has some banded DP but it seems it is not clear how it was done, Block Aligner is very clear about this.

Thanks,

Jianshu

Daniel-Liu-c0deb0t commented 2 months ago

Thanks! Actually Robert Edgar recently got in touch with me about Block Aligner stuff, probably for his tools.

Daniel-Liu-c0deb0t / block-aligner

Advice on setting for short query in long reference search #30