Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

option to not penalize gaps (or set higher penalty for gaps) #92

Closed daveuu closed 6 years ago

daveuu commented 6 years ago

For the use case of finding oligonucleotide binding sites (e.g., PCR primers) it may be preferable to find the best alignment without gaps. Sometimes two similar alignments differ by a gap instead of a substitution, but edlib prefers the gappy one:

import edlib
edlib.align('TGATTCTCAGCCCTTCGC','GCATGATTCTCAGCCCTCCGTGCA',mode='HW',task='path')
{'alphabetLength': 4,
 'cigar': '14=1I1=1I1=',
 'editDistance': 2,
 'locations': [(3, 18), (3, 19), (3, 20)]}

The cigar for that result generates this alignment if I am not mistaken:

   TGATTCTCAGCCCTTCGC
   ||||||||||||||*|*|
GCATGATTCTCAGCCCT-C-CGTGCA
   TGATTCTCAGCCCTTCGC
   ||||||||||||||*||*
GCATGATTCTCAGCCCTCCGTGCA

I'd like to find gapless alignments as in the latter case.

A related question: if edlib returns several locations, does that mean there are several alignment paths with equal scores? Does each location specify the start and end in the target sequence of a different alignment? In those instances, is it possible to return a cigar string for each alignment? edlib seems to only ever return one cigar string.

Thanks.

Martinsos commented 6 years ago

Hi @daveuu!

Regarding the first part, with gapless alignments. By definition, the latter case still has gaps, it is just that one of them moved to the end. I get an idea of what are you looking for, but I am not sure how to formally define it -> why is the second case better than the first one? You said you want gapless match, but there is still that gap inside, what about that? Would you rather just get the match without the last 4 letters, which eliminates all the gaps? I am not well acquainted with the use case of finding oligonucleotide binding sites, maybe I could get a better idea if you could explain to me in more details? By the way, have you considered using Smith-Waterman algorithm (not supported by edlib) -> maybe that would be a better match for your use case?

Ah yes, the returned locations are somewhat confusing -> I was not sure if and how are people going to use them, so they probably are not super user friendly. Answers to your questions:

  1. Yes, that means there are several alignment paths with equal scores.
  2. Yes, each location specifies start and end in the target sequence of a different alignment.
  3. Yes and no. It is not currently possible to return a cigar string for each alignment, edlib right now returns only the first alignment. However, I could implement it fairly easily as support is there in the code.

In general, there are usually many different alignments with same score. Some of them have different start and end locations, but it can even be that they have the same start and end locations and scores and are still different alignments. What edlib does, is is first find end locations of a few of best alignments. Maybe all of them, maybe not, it does not guarantee it will return all of them. I could implement that, but did not for the sake of speed. I could provide an option that tells edlib to return all of the end locations. Then, for each end location, edlib finds only one start location, even if there is more of them. Again, I could find all of them, but I don't do it because of the speed. Finally, I return cigar only for one alignment. I could return for more of them, but I don't, for the sake of speed. I could add an option for this, to return cigar for more of them, however I would have to figure out how to make this interface -> does user choose for which alignment, or do I return for all of them?

What could help is if you could again explain me what do you need in this case -> how are locations important to you, how do you use them?

I opened an issue #93 to better explain this in the future, as it is not well explained in the docs.

Thanks!

daveuu commented 6 years ago

Hello @Martinsos

For the latter alignment in mode='HW', my understanding is that it has two mismatches (instead of two gaps in the ref.) and no gaps, because gaps at the end are not penalised (because in this infix mode it is free to delete ends in the reference so they are not gaps in the query sequence). In mode='NW' it would make sense to say it has end gaps. I think maybe I didn't plot these correctly . . .

Two internal gaps (no *s), edit distance 2, as edlib returns the cigar and corresponding to the locations (3,20):

   TGATTCTCAGCCCTTCGC
   |||||||||||||| | |
GCATGATTCTCAGCCCT-C-CGTGCA

Two mismatches (2 *s), not internal gaps, edit distance 2 and effectively no gaps at all because mode='HW' (I'm including unaligned ends of the reference for context but in edlib's infix brain it would have deleted them?). Edlib presumably returns this alignment represented as locations (3,18), not the cigar:

   TGATTCTCAGCCCTTCGC
   ||||||||||||||*||*
GCATGATTCTCAGCCCTCCGTGCA

PS if this is correct, location 3 seems to correspond to a position in the reference (2nd sequence) while location 18 is in the query (1st sequence).

On the topic of oligo binding sites: pairwise sequence alignment in Biology is usually a comparison of homologous sequences that have diverged over time, accumulating substitutions, insertions and deletions. In contrast, alignment of an oligonucleotide to a potential binding site is a simplified model of a molecular interaction in which single "gaps" are unlikely because it would require a physical bend in the DNA molecule. Longer loops are possible as frequently happens in RNA stem-loop conformations, but a few consecutive nucleotides not binding (forming a "gap" in the other sequence) would not be expected. Mismatches can be allowed because binding will still happen if most nucleotides complement each other - I should also say when aligning an oligonucleotide to a DNA sequence, we a really testing it against the complementary strand. There are more sophisticated ways of modeling this of course, including predicting the minimum annealing temperature in a PCR given a certain number of mismatches %GC etc. But edlib would be much quicker to initially screen a lot of sequences.

So using edlib to find the best aligned position for an oligonucleotide like a PCR primer is a special case where no gaps should be allowed, but mismatches tolerated.

In the above case it turns out a gapless alignment has an equal edit distance to a gappy alignment, so in that case it might be possible to infer from the list of locations if a gapless alignment was found (but I'm not sure my interpretation of the locations is right - see above). However, it would be more convenient (and reliable?) to be able to tell edlib that gaps are worse than mismatches, or that gaps should not even be considered.

Martinsos commented 6 years ago

Ah sorry, I made a mistake when looking at the second alignment and thought that those are gaps, and not mismatches. Ok I see now and get it.

At the moment I think there is no easy way to make edlib prefer mismatches over gaps. As you said, the best way would be to be able to tell it that gaps are worse than mismatches, however we can't tell that to edlib by its definition -> it is an edit distance library, so all the "mutations" have to be equally penalized.

There are other alignments algorithms that are also fast but you can specify scoring matrix, which can penalize gaps with very high score, so I suggest exploring those for this use case!

Martinsos commented 6 years ago

Closing this one, if you have more thoughts or suggestions do not hesitate to post them!

daveuu commented 6 years ago

Sure, that makes sense, thanks for considering and thanks for edlib :+1:

jianshu93 commented 1 year ago

I have the same issue with this and I am wondering is there a way out to use large penalty scores for gaps within sequences and small penalty scores for gaps at both ends due to some real world facts, e.g., insertion and deletion are more expensive than mutation evolutionarily and easily happen at both ends than in the middle of the sequences. What other tools you recommended for this task? I am asking because edlib is really fast, it this because of simple/sam penalty scores for insertion/deletion and mutation?

Thanks,

Jianshu