Daniel-Liu-c0deb0t / block-aligner

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
https://crates.io/crates/block_aligner
MIT License
124 stars 7 forks source link

[BUG] Incorrect index calculation for score matrix setter #31

Closed rickbeeloo closed 2 months ago

rickbeeloo commented 2 months ago

I just wanted to add IUPAC support but came across this tiny bug it seems.

The calculation of the character indexes seems to be incorrect as different character pairs can modify the same indexes. At least from the check assert!(b'A' <= a && a <= b'Z'); I assume this should work for all alphabet character pairs? For example, the characters (b'W', b'C') and (b'S', b'G') both modify indexes 55 and 115.

Reproduce in block-aligner

fn main() {

    let mut scoring_matrix = NucMatrix::new_simple(1, -1);

    scoring_matrix.set(b'W', b'C', 1);
    assert_eq!(scoring_matrix.get(b'W', b'C'), 1); // PASS, correct score of 1

    scoring_matrix.set(b'G', b'S', -100);
    assert_eq!(scoring_matrix.get(b'W', b'C'), 1); // FAIL, It's not 1 anymore, it's -100

}

Reproduce isolated

fn as_idx(a:u8, b:u8) -> usize {
    ((a & 0b111) as usize) * 16 + ((b & 0b1111) as usize)
}

fn get_indexes(pair: (u8, u8)) -> (usize,usize) {
    let idx1 = as_idx(pair.0, pair.1);
    let idx2 = as_idx(pair.1, pair.0);
    (idx1, idx2)
}

fn main() {
    let pair1 = (b'W', b'C');
    let pair2 = (b'S', b'G');

    let pair1_indexes = get_indexes(pair1); 
    let pair2_indexes = get_indexes(pair2);

    assert_ne!(pair1_indexes.0, pair2_indexes.1);
}
rickbeeloo commented 2 months ago

Aah nvm, we have shouldn't use the nucmatrix in case of IUPAC then but a byte matrix

Daniel-Liu-c0deb0t commented 2 months ago

Ah I added this to the documentation but it hasn't been pushed to crates.io yet. Anyways, NucMatrix only supports ACGTN. You should use AAMatrix if you need to set custom scores for each character of the alphabet. ByteMatrix supports all bytes but it cannot be used together with X-drop.

For IUPAC, I would recommend using AAMatrix.

rickbeeloo commented 2 months ago

Hey @Daniel-Liu-c0deb0t , that would be awesome :)

I just tried this:

use block_aligner::{cigar::*, scan_block::*, scores::*};

fn get_iupac_mat() -> AAMatrix {
    // Default NW1 matrix to update
    let mut scoring_matrix = AAMatrix::new_simple(1, -1);
    let nucleotides = [b'A', b'C', b'G', b'T'];

    // IUPAC rules
    let iupac_pairs = [
        (b'R', &[b'A', b'G'] as &[u8]),
        (b'Y', &[b'C', b'T']),
        (b'S', &[b'G', b'C']),
        (b'W', &[b'A', b'T']),
        (b'K', &[b'G', b'T']),
        (b'M', &[b'A', b'C']),
        (b'D', &[b'A', b'G', b'T']),
        (b'H', &[b'A', b'C', b'T']),
        (b'V', &[b'A', b'C', b'G']),
    ];

    // IUPAC matches = match score 1
    for (base, matches) in iupac_pairs {
        for &m in matches {
            scoring_matrix.set(base, m, 1);
        }
    }   

    // IUPAC mismatches = match score -1
    for (base, matches) in iupac_pairs {
        for nt in nucleotides {
            if !matches.contains(&nt) {
                scoring_matrix.set(base, nt, -1);
            }
        }
    }   
    scoring_matrix
}

fn main() {

    let min_block_size = 32;
    let max_block_size = 256;

    // A gap of length n will cost: open + extend * (n - 1)
    let gaps = Gaps { open: -2, extend: -1 };

    let iupac_matrix = get_iupac_mat();

    let r = PaddedBytes::from_bytes::<NucMatrix>(b"ARG", max_block_size);
    let q = PaddedBytes::from_bytes::<NucMatrix>(b"AAG", max_block_size);

    // Align
    let mut a = Block::<false, false, false, true, true>::new(q.len(), r.len(), max_block_size);
    a.align(&q, &r, &iupac_matrix, gaps, min_block_size..=max_block_size, 0);
    let res: AlignResult = a.res();

    println!("Result: {:?}", res);
}

This triggers the debug_assert here. Can that be ignored? Edit: guess not, it also happens with the regular AAmatrix guess I just messed something up once again? :)

rickbeeloo commented 2 months ago

Aah the Paddedbytes need to be AAmatrix now as well of course 😅. Guess then it works will let you know

Daniel-Liu-c0deb0t commented 2 months ago

Ah perhaps PaddedBytes can store the scoring matrix type to verify that the right matrix is used.

rickbeeloo commented 2 months ago

Hey @Daniel-Liu-c0deb0t, still working a bit on this. Perhaps you have a good idea for this scenario:

use block_aligner::{cigar::*, scan_block::*, scores::*};

fn get_iupac_mat() -> AAMatrix {
    // Default NW1 matrix to update
    let mut scoring_matrix = AAMatrix::new_simple(1, -1);
    let nucleotides = [b'A', b'C', b'G', b'T'];

    // IUPAC rules
    let iupac_pairs = [
        (b'R', &[b'A', b'G'] as &[u8]),
        (b'Y', &[b'C', b'T']),
        (b'S', &[b'G', b'C']),
        (b'W', &[b'A', b'T']),
        (b'K', &[b'G', b'T']),
        (b'M', &[b'A', b'C']),
        (b'D', &[b'A', b'G', b'T']),
        (b'H', &[b'A', b'C', b'T']),
        (b'V', &[b'A', b'C', b'G']),
    ];

    // IUPAC matches = match score 1
    for (base, matches) in iupac_pairs {
        for &m in matches {
            scoring_matrix.set(base, m, 1);
        }
    }   

    // IUPAC mismatches = match score -1
    for (base, matches) in iupac_pairs {
        for nt in nucleotides {
            if !matches.contains(&nt) {
                scoring_matrix.set(base, nt, -1);
            }
        }
    }   
    scoring_matrix
}

fn main() {

    let min_block_size = 32;
    let max_block_size = 256;

    // A gap of length n will cost: open + extend * (n - 1)
    let gaps = Gaps { open: -2, extend: -1 };

    let iupac_matrix = get_iupac_mat();

    let r = PaddedBytes::from_bytes::<AAMatrix>(b"ARGTGGG", max_block_size);
    let q = PaddedBytes::from_bytes::<AAMatrix>(b"AAGAGGG", max_block_size);

    // Align
    let mut a = Block::<true, true, false, false, false>::new(q.len(), r.len(), max_block_size);
    a.align(&q, &r, &iupac_matrix, gaps, min_block_size..=max_block_size, q.len() as i32);
    let res: AlignResult = a.res();
    println!("res: {:?}", res); // AlignResult { score: 3, query_idx: 3, reference_idx: 3 }

    // trace
    let mut cigar = Cigar::new(res.query_idx, res.reference_idx);
    // Compute traceback and resolve =/X (matches/mismatches).
    a.trace().cigar_eq(&q, &r, res.query_idx, res.reference_idx, &mut cigar);

    assert_eq!(iupac_matrix.get(b'A', b'R'), 1);
    assert_eq!(iupac_matrix.get(b'R', b'A'), 1);

    unsafe {for i in (0..cigar.len()).rev() {
        let OpLen { op, len } = cigar.get(i);
        let q_char = q.get(cigar.len() - i) ; 
        let r_char = r.get(cigar.len() - i);
        println!("{} {:?} {}", q_char, op, r_char);
    }}

    println!("{}", cigar.to_string());

}

With the prints:

res: AlignResult { score: 5, query_idx: 7, reference_idx: 7 }
0 Eq 0
0 X 17
6 Eq 6
0 X 19
6 Eq 6
1=1X1=1X3=

For the sequences ARGTGGG and AAGAGGG (notice IUPAC R at i=1) the cigar becomes a bit hard to parse. Both A-R (0 X 17) and T-A (0 X 19) are considered mismatches, however the former is scored 1 and the latter -1. When I later calculate the identity I would want to see A-R as a match but not T-A as defined by the scoring matrix.

Is there any easy way to do this with block-aligner?

I could use the cigar to walk over q and r for the match region again > query the score matrix > get the score > count matches. But this feels a bit akward.


Edit For example this based on your Format method:

fn count_matches(cigar: &Cigar, q: &[u8], r: &[u8], mat: &AAMatrix) -> usize {
    println!("Q: {:?}", q);
    println!("R: {:?}", r);
    let mut i = 0;
    let mut j = 0;
    let mut matches: usize = 0;
    for x in (0..cigar.len()).rev() {
        let OpLen { op, len } = cigar.get(x);
        match op {
            Operation::M | Operation::Eq | Operation::X => {
                for _k in 0..len {
                    if  unsafe { mat.get(q[i], r[j]) == 1 } {
                        matches += 1;
                    }
                    i += 1;
                    j += 1;
                }
            },
            Operation::I => {
                for _k in 0..len {
                    i += 1;
                }
            },
            Operation::D => {
                for _k in 0..len {
                    j += 1;
                }
            },
            _ => continue
        }
    }
    matches
}

With

let r_org = b"ARGTGGG";
let q_org = b"AAGAGGG";

let r = PaddedBytes::from_bytes::<AAMatrix>(r_org, max_block_size);
let q = PaddedBytes::from_bytes::<AAMatrix>(q_org, max_block_size);
...
let m = count_matches(&cigar, q_org, r_org, &iupac_matrix);
println!("Actual m: {}", m); // 6  (previously this would be 5 based on Eq/M)

Ignore the 'unsafe', I thought it would be nicer to actually use the PaddedBytes as q and r for count_matches, but didn't figure that out 😃

Daniel-Liu-c0deb0t commented 2 months ago

Ah yeah there isn't an easy way to specify custom "equals" function for the traceback. I think the API could probably be expanded here by accepted either a custom equals matrix/closure or just providing a less error-prone iterator API over the sequences.

PaddedBytes's get functions are deliberately not for public use because it returns characters in the transformed (padded) positions. Perhaps the API can be extended to have two sets of functions for padded and unpadded versions. What you have now is how I'm using Block Aligner in other applications.

Thanks for experimenting with Block Aligner and writing down your progress, I have many ideas now for improving the API.