Daniel-Liu-c0deb0t / block-aligner

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
https://crates.io/crates/block_aligner
MIT License
124 stars 7 forks source link

Advice on setting for short query in long reference search #30

Open rickbeeloo opened 2 months ago

rickbeeloo commented 2 months ago

Hey @Daniel-Liu-c0deb0t!

I have a bunch of sequences (most 20nts, few 27nts) which I want to find in a lot (millions) of longer sequences.

Initially I use a k-mer index, and find matching pairs, which I could probably extend using the block-aligner (perhaps using block aligner directly could work? not sure). I'm a bit stuck after reading https://github.com/Daniel-Liu-c0deb0t/block-aligner/issues/28. Using the k-mers as seeds I could find shared prefixes between the query and references, although a kmer might be slightly of of-course so setting something like FREE_QUERY_START_GAPS might help, and to terminate earlier the FREE_QUERY_END_GAPS. However, I'm not sure when reading the docs what to set exactly to achieve this as there seem quite some exceptions to keep in mind "Note that this has a limitation: the min block size must be greater than the length of the query.". Could you provide an example that does something like this?

For example, lets take:

query: ATGGGC
reference: CATGGGCAAAAAAAAAAAA

A good alignment would be skipping the first C in the reference, and then aligning ATGGGC and using all the A's as a gap.

Perhaps something like this:

let min_block_size = 32;
let max_block_size = 256;

// A gap of length n will cost: open + extend * (n - 1)
let gaps = Gaps { open: -2, extend: -1 };

// Note that PaddedBytes, Block, and Cigar can be initialized with sequence length
// and block size upper bounds and be reused later for shorter sequences, to avoid
// repeated allocations.
let r = PaddedBytes::from_bytes::<NucMatrix>(b"ATGGGC", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"CATGGGCAAAAAAAAAAAA", max_block_size);

// Align with traceback, with x-drop
let mut a = Block::<true, true, true, false, false>::new(q.len(), r.len(), max_block_size);
a.align(&q, &r, &NW1, gaps, min_block_size..=max_block_size, 1000000000);
let res: AlignResult = a.res();

println!("Result: {:?}", res);

let mut cigar = Cigar::new(res.query_idx, res.reference_idx);
// Compute traceback and resolve =/X (matches/mismatches).
a.trace().cigar_eq(&q, &r, res.query_idx, res.reference_idx, &mut cigar);

println!("{}", cigar.to_string())

Giving:

Result: AlignResult { score: 6, query_idx: 7, reference_idx: 6 }
6=

But can I now know the positions in the reference? Ideally without computing the cigar as you said it's expensive.

Thanks in advance!

Daniel-Liu-c0deb0t commented 2 months ago

So like I mentioned in the other issue, there are two approaches you can take:

Daniel-Liu-c0deb0t commented 2 months ago

In your example, I think you swapped q and r. Anyways, the query_idx and reference_idx in the result are the end positions of the alignment. For global and X-drop alignment, the start is known to be at the beginning of the sequences. For free query start gaps, you have to compute the traceback to find the start.

This is some related code I wrote, it might be helpful: https://github.com/Daniel-Liu-c0deb0t/ANTISEQUENCE/blob/main/src/iter/match_any_reads.rs#L544

Daniel-Liu-c0deb0t commented 2 months ago

Ok I made a diagram (see https://docs.rs/block-aligner/0.5.1/block_aligner/scan_block/struct.Block.html) that should make the different alignment types more clear.

jianshu93 commented 2 months ago

Hello Both, the explanation (diagram) are really very useful and I understand what exactly I need. Just for fun, usearch is open sourced several weeks ago, I think there you can find exactly what you need to do semi-global alignment, but in this case the alignment position (start and end) is not easy to obtains since usearch semi-global alignment mode do not report that (which is the query is hard to say in semi-global mode but usearch always assume the query sequences are the query) but the local mode, usearch_local does report them, both mode implemented X-Drop algorithm for fast database search. In your example provided, I think you need local alignment with large x-drop score (usearch -usearch_local), so that you can approximate semi-global mode which is very not easy as Daniel explains (reverse sequences et.al.). Usearch also has some banded DP but it seems it is not clear how it was done, Block Aligner is very clear about this.

Thanks,

Jianshu

Daniel-Liu-c0deb0t commented 2 months ago

Thanks! Actually Robert Edgar recently got in touch with me about Block Aligner stuff, probably for his tools.