Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

Degenerate nucleotides? #82

Closed nextgenusfs closed 6 years ago

nextgenusfs commented 7 years ago

Nice work here, I am looking for a lightweight aligner that I can use in Python for quickly finding primer binding sites (~20 bp). I just installed the python version and running some preliminary tests. Its seems to be fairly quick. I'm also then curious if edlib supports degenerate nucleotide matching, e.g. R is A or G (http://www.bioinformatics.org/sms/iupac.html)? It seems that at least with default settings that it does not?

Martinsos commented 7 years ago

Thanks! I am glad you find edlib useful - yes, it should certainly be among the fastest out there.

Interesting question - it does not support that functionality right now, however I believe it should be able to support it. I would have to take some time and dive in deeper to make sure what would be the performance impact of adding such feature, but it should mostly come to modifying this line so that instead of just checking that characters are equal, we would check if they are defined as equal by a function or matrix that you would pass as part of the edlib config. Then it would come down to propagating that change through whole the codebase - other parts of the algorithm, tests and apps, and checking that performance is not affected.

nextgenusfs commented 7 years ago

If there is a speed reduction to implementing it, would it be possible to add then just an option to use degenerate nucleotides? Probably the main use case for degenerate nucleotides comes from using them in primer sequences, so maybe this is most relevant to shorter sequences (https://github.com/Martinsos/edlib/issues/77). Although longer consensus sequences may also have degenerate nucleotides as output from a variety of tools.

nextgenusfs commented 7 years ago

Edlib is really really great, speeding up my function by more than 10X in comparison to string matching I was doing in Python!! Thanks a lot for this. Although I do need degenerate nucleotide match functionality or will have to do multiple searches if degenerate nucleotide found, so let me know if/where/when I can help!

Martinsos commented 7 years ago

I am glad it helps :)!! I will have to dive deeper into this to be sure about the speed impact and if it can even be done, but I am hopeful that it can and that it will not affect the speed. And yes I would probably make it an additional option. Thanks for offering to help: could you provide me with a short, illustrative example (it can be made up), just so that I am sure I got the concept right? The table you shared with me, could you also explain that a little bit? What I am not sure about is: so we have ACTG sequences and we have this R - where can it appear, both in query and target or just one of them? R matches A or G, but does it match another R (or that can not happen)? I don't have much time right now so I will probably not be able to get to this very soon - I hope I will get to it in the next few weeks, but it may take longer.

nextgenusfs commented 7 years ago

Sure. So basically all I'm doing right now is to do something like the following:

#align a primer to sequence, primer has 0 or more degenerate nucleotides
import edlib
align = edlib.align('GTGARTCATCGAATCTTTG', 'GTGAGTCATCGAATCTTTGAACGCACCTTGCGCTCCTTGGT', mode="HW", k=2)
print(align)
{'editDistance': 1, 'cigar': None, 'locations': [(None, 18)], 'alphabetLength': 5}

#the region that aligns, R should be either A or G, so this should be editDistance=0
GTGARTCATCGAATCTTTG
GTGAGTCATCGAATCTTTG

While it is possible that the target may also have degenerate nucleotides, I think in practice that would be rare for most datasets (and not something I would need). But yes an R in the query should match a R in the target

A python dictionary for "mapping" the degenerate nucleotides would look like this:

DegenNuc = {'M': 'AC',  'R': 'AG', 'W':'AT', 'S':'CG', 'Y':'CT', 'K':'GT', 'V':'ACG', 
    'H':'ACT', 'D':'AGT', 'B':'CGT', 'N':'GATC'}

So basically if a 'M' is in the query sequence, it would then match an 'A', 'C', or 'M' in the target sequence. 'N' is ambiguous meaning it would match any other nucleotide.

Martinsos commented 7 years ago

Awesome, thanks! Ok, everything is clear - I have to say I am excited about this feature, it sounds like it could make Edlib more useful while not breaking he structure too much :). I will keep you posted!

nextgenusfs commented 6 years ago

Hi @Martinsos, are there any updates on this? I've incorporated edlib into my NGS amplicon toolkit, however, before I can release this new version I feel that I need to address the degenerate nucleotides as there are some common primers that people use that have several degenerate nucleotides. Currently, I've dropped the edlib files into my project and am building them during install, so if I knew how to try to get this to work, I could implement in the copy I've got in my project. You mention above that it should be somewhat straight forward to modify that line to look in a matrix of degenerate nucleotides, I'm just not sure how to do this. Thanks for any guidance/help you might have.

Martinsos commented 6 years ago

Hi @nextgenusfs , sorry for that, I was crowded and just didn't get to it. I will take some time this weekend and either come up with a solution or at least provide instructions for you how to do it.

nextgenusfs commented 6 years ago

That would be fantastic! Edlib has already saved me lots of processing time, will be nearly perfect for me needs if will support degenerate nucleotides.

Martinsos commented 6 years ago

@nextgenusfs I implemented the feature we were talking about (92f6a3f6d5cb1d7ed2e688b5092d968a843f9ded)! I did some basic tests and it seems to be working fine -> in the future I will add more exhaustive random tests, but I did not have time for that right now. I created a new release, and I also pushed new python package version. Let me know if anything is unclear! By the way, how do you use edlib? Are you using it as an C++ library, C library, or python package?

nextgenusfs commented 6 years ago

@Martinsos thank you! I was just using the python package as I'm doing string manipulation in Python and using edlib to align. If there is a better way to use it, I'm all ears!

Martinsos commented 6 years ago

Python is completely fine :), I am asking because I am trying to figure out what are most of the people using -> I would love to drop the support for C if nobody is using it, as it complicates development. I am closing this one -> let me know if there are any problems with the new feature.