Closed rickbeeloo closed 2 months ago
Aah nvm, we have shouldn't use the nucmatrix in case of IUPAC then but a byte matrix
Ah I added this to the documentation but it hasn't been pushed to crates.io yet. Anyways, NucMatrix
only supports ACGTN
. You should use AAMatrix
if you need to set custom scores for each character of the alphabet. ByteMatrix
supports all bytes but it cannot be used together with X-drop.
For IUPAC, I would recommend using AAMatrix
.
Hey @Daniel-Liu-c0deb0t , that would be awesome :)
I just tried this:
use block_aligner::{cigar::*, scan_block::*, scores::*};
fn get_iupac_mat() -> AAMatrix {
// Default NW1 matrix to update
let mut scoring_matrix = AAMatrix::new_simple(1, -1);
let nucleotides = [b'A', b'C', b'G', b'T'];
// IUPAC rules
let iupac_pairs = [
(b'R', &[b'A', b'G'] as &[u8]),
(b'Y', &[b'C', b'T']),
(b'S', &[b'G', b'C']),
(b'W', &[b'A', b'T']),
(b'K', &[b'G', b'T']),
(b'M', &[b'A', b'C']),
(b'D', &[b'A', b'G', b'T']),
(b'H', &[b'A', b'C', b'T']),
(b'V', &[b'A', b'C', b'G']),
];
// IUPAC matches = match score 1
for (base, matches) in iupac_pairs {
for &m in matches {
scoring_matrix.set(base, m, 1);
}
}
// IUPAC mismatches = match score -1
for (base, matches) in iupac_pairs {
for nt in nucleotides {
if !matches.contains(&nt) {
scoring_matrix.set(base, nt, -1);
}
}
}
scoring_matrix
}
fn main() {
let min_block_size = 32;
let max_block_size = 256;
// A gap of length n will cost: open + extend * (n - 1)
let gaps = Gaps { open: -2, extend: -1 };
let iupac_matrix = get_iupac_mat();
let r = PaddedBytes::from_bytes::<NucMatrix>(b"ARG", max_block_size);
let q = PaddedBytes::from_bytes::<NucMatrix>(b"AAG", max_block_size);
// Align
let mut a = Block::<false, false, false, true, true>::new(q.len(), r.len(), max_block_size);
a.align(&q, &r, &iupac_matrix, gaps, min_block_size..=max_block_size, 0);
let res: AlignResult = a.res();
println!("Result: {:?}", res);
}
This triggers the debug_assert
here. Can that be ignored? Edit: guess not, it also happens with the regular AAmatrix guess I just messed something up once again? :)
Aah the Paddedbytes need to be AAmatrix now as well of course 😅. Guess then it works will let you know
Ah perhaps PaddedBytes can store the scoring matrix type to verify that the right matrix is used.
Hey @Daniel-Liu-c0deb0t, still working a bit on this. Perhaps you have a good idea for this scenario:
use block_aligner::{cigar::*, scan_block::*, scores::*};
fn get_iupac_mat() -> AAMatrix {
// Default NW1 matrix to update
let mut scoring_matrix = AAMatrix::new_simple(1, -1);
let nucleotides = [b'A', b'C', b'G', b'T'];
// IUPAC rules
let iupac_pairs = [
(b'R', &[b'A', b'G'] as &[u8]),
(b'Y', &[b'C', b'T']),
(b'S', &[b'G', b'C']),
(b'W', &[b'A', b'T']),
(b'K', &[b'G', b'T']),
(b'M', &[b'A', b'C']),
(b'D', &[b'A', b'G', b'T']),
(b'H', &[b'A', b'C', b'T']),
(b'V', &[b'A', b'C', b'G']),
];
// IUPAC matches = match score 1
for (base, matches) in iupac_pairs {
for &m in matches {
scoring_matrix.set(base, m, 1);
}
}
// IUPAC mismatches = match score -1
for (base, matches) in iupac_pairs {
for nt in nucleotides {
if !matches.contains(&nt) {
scoring_matrix.set(base, nt, -1);
}
}
}
scoring_matrix
}
fn main() {
let min_block_size = 32;
let max_block_size = 256;
// A gap of length n will cost: open + extend * (n - 1)
let gaps = Gaps { open: -2, extend: -1 };
let iupac_matrix = get_iupac_mat();
let r = PaddedBytes::from_bytes::<AAMatrix>(b"ARGTGGG", max_block_size);
let q = PaddedBytes::from_bytes::<AAMatrix>(b"AAGAGGG", max_block_size);
// Align
let mut a = Block::<true, true, false, false, false>::new(q.len(), r.len(), max_block_size);
a.align(&q, &r, &iupac_matrix, gaps, min_block_size..=max_block_size, q.len() as i32);
let res: AlignResult = a.res();
println!("res: {:?}", res); // AlignResult { score: 3, query_idx: 3, reference_idx: 3 }
// trace
let mut cigar = Cigar::new(res.query_idx, res.reference_idx);
// Compute traceback and resolve =/X (matches/mismatches).
a.trace().cigar_eq(&q, &r, res.query_idx, res.reference_idx, &mut cigar);
assert_eq!(iupac_matrix.get(b'A', b'R'), 1);
assert_eq!(iupac_matrix.get(b'R', b'A'), 1);
unsafe {for i in (0..cigar.len()).rev() {
let OpLen { op, len } = cigar.get(i);
let q_char = q.get(cigar.len() - i) ;
let r_char = r.get(cigar.len() - i);
println!("{} {:?} {}", q_char, op, r_char);
}}
println!("{}", cigar.to_string());
}
With the prints:
res: AlignResult { score: 5, query_idx: 7, reference_idx: 7 }
0 Eq 0
0 X 17
6 Eq 6
0 X 19
6 Eq 6
1=1X1=1X3=
For the sequences ARGTGGG
and AAGAGGG
(notice IUPAC R at i=1) the cigar becomes a bit hard to parse.
Both A-R (0 X 17)
and T-A (0 X 19)
are considered mismatches, however the former is scored 1
and the latter -1
. When I later calculate the identity I would want to see A-R
as a match but not T-A
as defined by the scoring matrix.
Is there any easy way to do this with block-aligner?
I could use the cigar to walk over q and r for the match region again > query the score matrix > get the score > count matches. But this feels a bit akward.
Edit For example this based on your Format method:
fn count_matches(cigar: &Cigar, q: &[u8], r: &[u8], mat: &AAMatrix) -> usize {
println!("Q: {:?}", q);
println!("R: {:?}", r);
let mut i = 0;
let mut j = 0;
let mut matches: usize = 0;
for x in (0..cigar.len()).rev() {
let OpLen { op, len } = cigar.get(x);
match op {
Operation::M | Operation::Eq | Operation::X => {
for _k in 0..len {
if unsafe { mat.get(q[i], r[j]) == 1 } {
matches += 1;
}
i += 1;
j += 1;
}
},
Operation::I => {
for _k in 0..len {
i += 1;
}
},
Operation::D => {
for _k in 0..len {
j += 1;
}
},
_ => continue
}
}
matches
}
With
let r_org = b"ARGTGGG";
let q_org = b"AAGAGGG";
let r = PaddedBytes::from_bytes::<AAMatrix>(r_org, max_block_size);
let q = PaddedBytes::from_bytes::<AAMatrix>(q_org, max_block_size);
...
let m = count_matches(&cigar, q_org, r_org, &iupac_matrix);
println!("Actual m: {}", m); // 6 (previously this would be 5 based on Eq/M)
Ignore the 'unsafe', I thought it would be nicer to actually use the PaddedBytes
as q
and r
for count_matches
, but didn't figure that out 😃
Ah yeah there isn't an easy way to specify custom "equals" function for the traceback. I think the API could probably be expanded here by accepted either a custom equals matrix/closure or just providing a less error-prone iterator API over the sequences.
PaddedBytes
's get functions are deliberately not for public use because it returns characters in the transformed (padded) positions. Perhaps the API can be extended to have two sets of functions for padded and unpadded versions. What you have now is how I'm using Block Aligner in other applications.
Thanks for experimenting with Block Aligner and writing down your progress, I have many ideas now for improving the API.
I just wanted to add IUPAC support but came across this tiny bug it seems.
The calculation of the character indexes seems to be incorrect as different character pairs can modify the same indexes. At least from the check
assert!(b'A' <= a && a <= b'Z');
I assume this should work for all alphabet character pairs? For example, the characters(b'W', b'C')
and(b'S', b'G')
both modify indexes55
and115
.Reproduce in block-aligner
Reproduce isolated