applied-bioinformatics / An-Introduction-To-Applied-Bioinformatics

Interactive lessons in bioinformatics.
http://readIAB.org
Other
792 stars 315 forks source link

The implementation of affine gap penalty is not correct. #328

Closed TianyiShi2001 closed 4 years ago

TianyiShi2001 commented 4 years ago

Describe the environment

I am reading IAB version 0.1.4.dev0 and using skbio version 0.5.6. The implementation of affine gap penalty (i.e. specifying gap_open and gap_extend) is conceptually not correct.

Describe the problem

Steps to reproduce

An example it fails to produce the best alignment:

from skbio.alignment._pairwise import global_pairwise_align_nucleotide
from skbio.sequence import DNA
global_pairwise_align_nucleotide(
DNA("GCAAAAGCTGGTATTAAAGT"), DNA("GCATATTACGTGGTGATTCAAGAGGCCTTCG"),
gap_open_penalty=5, gap_extend_penalty=1, match_score=5, mismatch_score=-2, penalize_terminal_gaps=True))

which produces the following result which scores 39 (5*18-2*1-5-6-5-5-5-5-7-6-5):

GCA-A--AAGCTGGT-ATT-AA-A-G---T--
GCATATTACG-TGGTGATTCAAGAGGCCTTCG, 39.0, [(0, 19), (0, 30)])

The actual best alignment, however, gives rise to a score of 45 (5*16-2*4-6-5-10-6):

GCAAA--AGCTGGT-ATTAAAG------T--
GCATATTACGTGGTGATTCAAGAGGCCTTCG

Suggestions

According to orthodox textbooks (see references below), the simplest correct way of dealing with affine gap penalty is to use three (pairs) of matrices corresponding to 1) substitution 2) gap in sequence S, and 3) gap in sequence T. Recently, I implemented an aligner in this way in Javascript, which successfully produces the best alignment shown above.

References

  1. "Section 2.5.2: Affine Gap Penalty Model", in Algorithms in bioinformatics : a practical introduction (2010), Sung, Wing-Kin, CRC Press
  2. "Section 6.9: Alignment with Gap Penalties", in An Introduction to Bioinformatics Algorithms (2004), Neil C. Jones and Pavel A. Pevzner, The MIT Press
TianyiShi2001 commented 4 years ago

Because this is the problem of scikit-bio, let's discuss in scikit-bio's issue.