Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
506 stars 165 forks source link

Runtime error #43

Closed hiraksarkar closed 8 years ago

hiraksarkar commented 8 years ago

edlib.cpp:1062: int obtainAlignment(const unsigned char, const unsigned char, int, const unsigned char, const unsigned char, int, int, int, unsigned char*, int): Assertion `score_ == bestScore' failed.

Martinsos commented 8 years ago

Hi @hiraksarkar, thank you for reporting this! Could you please provide me with input that that you used, so I can replicate this bug?

hiraksarkar commented 8 years ago

auto query =(const unsigned char *)seq[u].c_str(); auto target =(const unsigned char *)seq[v].c_str(); int alphabetLength = 4; int queryLength = seq[u].size(); int targetLength = seq[v].size(); int score, numLocations, alignmentLength; int* startLocations; int* endLocations; unsigned char* alignment; edlibCalcEditDistance(query, queryLength, target, targetLength, alphabetLength, -1, EDLIB_MODE_NW, true, true, &score, &endLocations, &startLocations, &numLocations, &alignment, &alignmentLength); char* cigar; edlibAlignmentToCigar(alignment, alignmentLength, EDLIB_CIGAR_EXTENDED, &cigar); where seq is an array taken from fastq files..

Martinsos commented 8 years ago

@hiraksarkar ok awesome, we are almost there! Could you also provide me with the exact specific pair of sequences that this fails for? Without them, I will probably not be able to replicate this error since I already run a lot of random tests and non of them failed - so if it was something simple I would have caught it.

hiraksarkar commented 8 years ago

Well I am sorry to be a pain in the neck but it failed while I was running it in a batch, So its hard to specify the exact sequence. I will try to run again and report the sequences.. Thanks for quick response !

Martinsos commented 8 years ago

Hey no problem, I am actually happy that you reported it because then I can fix it :)! Yes I figured it was a batch, from your u and v vars - take your time.

hiraksarkar commented 8 years ago

The sequences are CCTCCCACCGCTTTATTTCTTTCGGTTTCGGATGCAAAACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa and CCTCCCACCGCTTTATTTCTTTCGGTTTCGGATGCAAAACaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

Martinsos commented 8 years ago

Great! I will check them out in the next few days and try to find the problem. What is with all the small a's - is that an error, or should it be like that? Also, these two sequences are identical, correct?

hiraksarkar commented 8 years ago

does edlib support lower cases ?

Martinsos commented 8 years ago

Well edlib accepts numbers (unsigned char), not letters, so it all depends on how you translated those letters into numbers.

hiraksarkar commented 8 years ago

When I do a type conversion it should do it automatically right ? I mean this line auto query =(const unsigned char *)seq[u].c_str()

Martinsos commented 8 years ago

Yes, that is correct. You can check aligner.cpp to see how I read fasta sequences from file and convert them into such sequences of numbers - I should probably make that function a part of library in the future.

I would say it should do it automatically yes (conversion)! But lower case will be different than upper case then. And you defined alphabet length as 4, while this makes it have a length of 5. That is most likely the source of this error!

hiraksarkar commented 8 years ago

I just used boost::to_upper to resolve that but it still giving the same error !

Martinsos commented 8 years ago

Ok, then just let me take a look at it and I will let you know if I find anything!

hiraksarkar commented 8 years ago

Sure

Martinsos commented 8 years ago

@hiraksarkar I got it! Problem is that sequences have to be translated into numbers from 0 to N if alphabet length is N. That is said in edlib.h in comments:

Query and target are represented as arrays of numbers, where each number is                                  
index of corresponding letter in alphabet. So for example if alphabet is ['A','C','T','G']                   
and query string is "AACG" and target string is "GATTCGG" then our input query should be                     
[0,0,1,3] and input target should be [3,0,2,2,1,3,3] (and alphabetLength would be 4).

I completely forgot about this -> I will put an issue to make this more obvious and to also create some helper function to deal with this, since it is sounds boring to have to translate those sequences this way.

Martinsos commented 8 years ago

I added issue for this: #44 .

hiraksarkar commented 8 years ago

So this conversion is not implicit ?

Martinsos commented 8 years ago

You are letting c++ do the conversion, which means A is converted to its ascii code which is 65, C is converted to its ascii code which is 67, and so on. Try printing the values. Instead, you want A to be 0, C to be 1, T to be 2, and G to be 3 (not in that order necessarily, but you do want them to be transformed into space from 0 to 3).

Martinsos commented 8 years ago

@hiraksarkar let me know if this solved the issue so we can close this one!

hiraksarkar commented 8 years ago

Right now I am using enum to by pass the issue so you can close it ... I would definitely shout if I bump into similar problem. Thanks !

Martinsos commented 8 years ago

Great, I am glad it works now! Also, if you find edlib useful, feel free to star it - I love collecting those :D.