Daniel-Liu-c0deb0t / block-aligner

SIMD-accelerated library for computing global and X-drop affine gap penalty sequence-to-sequence or sequence-to-profile alignments using an adaptive block-based algorithm.
https://crates.io/crates/block_aligner
MIT License
124 stars 7 forks source link

Support alignment of protein sequences containing "*" #1

Open matchy233 opened 3 years ago

matchy233 commented 3 years ago

I'm using the C API of block-aligner to align protein sequences from UniProt database. There are *s in some protein sequences. Currently using block-aligner to align sequences containing * will cause a Segmentation Fault. Although the users can resolve it by mapping * to other supported chars, it would be nice if we can support * internally! :)

Daniel-Liu-c0deb0t commented 3 years ago

I'm not sure if * will every be directly supported internally. It will always have to be mapped to some character that fits within the scoring matrix, so SIMD lookups can be done. Right now, the amino acid matrix supports alphabetical characters A-Z.

There are a couple of ways this could be solved:

  1. An unused letter like J could be used to represent *, like what you said. On the Rust side, the scores in the amino acid matrix can be cloned and changed, but this is not yet exposed in the C API. Without changing the scores, matches and mismatches with J incur a score of -128.
  2. A letter not part of the original 20 amino acids but still has predefined scores can be used. For example, * can be translated to X.
  3. Require letters to be mapped to numerical values 0-20, then allow block aligner to align numerical strings.