Open nrminor opened 7 months ago
Thanks for trying this crate out!
You do not need to modify the PaddedBytes
construction lines because they only need to know the type of the matrix (NucMatrix
, AAMatrix
, etc.). This affects how the alphabet is encoded internally in the PaddedBytes
.
To use your newly created scoring_matrix
, you need to change the following line:
a.align(&q, &r, &NW1, gaps, min_block_size..=max_block_size, 0);
to
a.align(&q, &r, &scoring_matrix, gaps, min_block_size..=max_block_size, 0);
This is where you actually run the aligner, so you need to provide a concrete instance of a scoring matrix. The NW1
scoring matrix is a predefined matrix with just +1
for matches and -1
for mismatches.
Oh perfect. Thanks for this, Daniel! That makes a lot of sense.
I've written block-aligner
into my tool, but I've run into some trouble with my original idea of scoring mismatches with N as zero, and I wondered if I could ask for some of your insight as someone who has thought about this stuff much more.
I'll give a small example to get at what I'm seeing. Let's say we're comparing sequence A and sequence B, as follows:
>Sequence A
AAATGCCATCGCGTTGAA
>Sequence B
AAATGNNATCGNNNNNNNTAAA
With a simple NW1-style scoring matrix, where gap open is -2 and extend is -1, I imagine the sequences would be aligned something like this:
AAATGCCATCGCGT----TGAA
AAATGNNATCGNNNNNNNTAAA
Does that look right to you? If so, I'd expect the score to reflect 6 mismatches (-6), a gap open (-2), 3 gaps (-3), and 11 matches (+11), for a final score of 0.
However, if we want to update the scoring matrix such that mismatches with N's score to zero, would that mean that the aligner will always preference mismatching with Ns? With the above example, would it then do something like:
AAATGCCATCGCGTTGAA----
AAATGNNATCGNNNNNNNTAAA
According to the updating scoring matrix, this would give 8 matches, a gap open, and 3 gap extends, for an improved final score of 3. For much longer sequences, I suspect this would inflate the apparent score for sequences with N's substantially. In essence, I've achieved exactly what I wanted to avoid: a distance matrix that reflects more about the presence of N's than it does about the similarity between two sequences.
That said, I still think you've built an API that could help solve this! Is there a way you might suggest handling this situation in block-aligner
? For example, is it possible to make the gap penalty lower on N's, but not on normal ATGC bases? Or would you just increase the upside for matches?
Thanks for taking the time to help! --Nick
Interesting scenario! I would probably do something like: increasing the match score (eg., +3) and decreasing the mismatch score (eg., -3) for ATCG and then still penalize mismatches with N slightly (eg., -1)?
Position-specific gap penalties is supported and it offers maximum flexibility, but its a little more involved to use.
Hi Daniel,
I confess I haven't quite figured out how to customize scoring for each nucleotide character, but I think I'm getting close. As you'll recall from my issue on
triple_accel
, I'm looking to compute nucleotide distances for pairs of sequences that ignore characters other than A, T, G, and C. Do I have it right that you can control that inblock-aligner
with the Matrix trait'sset
method? For example:If so, could you provide some guidance on how to use an updated
NucMatrix
in this portion of your readme example?Thanks for the advice and for the excellent crate! --Nick