Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

feature to have python output actual alignment #132

Closed evanbiederstedt closed 4 years ago

evanbiederstedt commented 5 years ago

This is a (flawed) start, related to the following issue: https://github.com/Martinsos/edlib/issues/127 (and this duplicate: https://github.com/Martinsos/edlib/issues/129)

This uses @yech1990 's example to parse the extended cigar (which is just a compressed version of "alignment" from cpp edlib) from the Cython script edlib.pyx

: A method cigarToNiceAlignment that would take cigar returned by align + whatever else it needs (query and target, maybe mode I am not sure at the moment, check cpp implementation of NICE) and returns string which is NICE alignment. That should be it! And it should mostly be code you already wrote here, just slightly modified

I didn't quite do this, but just revised the align() method.

The problem:

As far as I can tell, ccigar is always empty.

I've tried in the Cython code printing these variables:

string_alignmentLength = <bytes> cresult.alignmentLength

It appears that cresult.alignmentLength is also empty. cedlib.EDLIB_CIGAR_EXTENDED is some sort of integer?

These are the issues I'm running into trying to finish this PR

evanbiederstedt commented 5 years ago

I'm actually not sure what cresult.alignment is---I've tried to access it a few ways, and I get segmentation faults.

I get the same issue at @yech1990 in https://github.com/Martinsos/edlib/issues/127 on April 10th I believe the conditional at if cresult.alignment: is never hit, which is a problem (I think).

Note that I also just created more fields in the result of align(). That appears to be a flexible solution.

Martinsos commented 5 years ago

@evanbiederstedt, it is nice to see that you are doing progress with this :)!

Since python library is working correctly for me, I am guessing the problem might be that you are not running it with correct input parameters: you need to set task to "path" if you want edlib to return alignment/cigar. If it is not set to "path", ccigar and cresult.alignment will be empty. Check the function comments for param task and it should be clear. Maybe we should also mention under comments for returned value that cigar will be returned only if task is set to "path".

Regarding the approach itself, while it is not bad, I am not a fan of putting more unnecessary logic into this main function. The way you did it, it is not optional to choose if NICE alignment should be constructed or not. I would highly prefer having separate function as I mentioned before, so that user can easily use if they want + it does not further complicate / slow down the main function.

evanbiederstedt commented 5 years ago

Regarding the approach itself, while it is not bad, I am not a fan of putting more unnecessary logic into this main function. The way you did it, it is not optional to choose if NICE alignment should be constructed or not. I would highly prefer having separate function as I mentioned before, so that user can easily use if they want + it does not further complicate / slow down the main function.

This is a very good point. It's quite possible that my additions would slow down the method. Yes, I agree with this---I'll implement this as a separate function.

I am not sure what is it that user needs, if you think it is better to separate it upfront, we can go with that. In that case, I would explain that well in the comments and also return structure with nice property names for each of those 3 segments, not just an array/triple.

I think four fields make sense, i.e. four key-value pairs in a dictionary. I think NICE, ref_aln, match_aln, query_aln should work----naturally, I'm happy to revise the names.

evanbiederstedt commented 5 years ago

Hi @Martinsos

Since python library is working correctly for me, I am guessing the problem might be that you are not running it with correct input parameters: you need to set task to "path" if you want edlib to return alignment/cigar. If it is not set to "path", ccigar and cresult.alignment will be empty. Check the function comments for param task and it should be clear.

This is 100% correct. The source code of the current PR works if task="path".

e.g.

>>> import edlib
>>> edlib.align(query="AAGGGGTCTCATATC", target="TAAGGATGGTCCCATTC", task="path")
## outputs
{'match_aln': ' ||||  ||||.||| ||', 'ref_aln': 'TAAGGATGGTCCCAT TC', 
    'query_aln': ' AAGG  GGTCTCATATC', 'editDistance': 5, 
    'cigar': u'1D4=2D4=1X3=1I2=', 'locations': [(0, 16)], 'alphabetLength': 4,
    'NICE': 'TAAGGATGGTCCCAT TC\n ||||  ||||.||| ||\n AAGG  GGTCTCATATC'}

So, I'll revise this into a separate function cigarToNiceAlignment, and pass it over for review. There may be issues with code duplication, etc.

Thanks, Evan

evanbiederstedt commented 5 years ago

Hi @Martinsos

So, I revised the previous proposal, and implemented cigarToNiceAlignment().

(I'm not sure cigarToNiceAlignment() is the best name, but I cannot think of another. Maybe niceAlign()?)

Currently, thinking for potential use cases, I believe I would want the cigar, the NICE alignment, and each component of the NICE alignment.

I'm tempted to suggest we use - instead of spaces to show how the two inputs align.

That is, instead of

TAAGGATGGTCCCAT TC
 ||||  ||||.||| ||
 AAGG  GGTCTCATATC

I wonder if users would prefer

TAAGGATGGTCCCAT-TC
 ||||  ||||.||| ||
-AAGG--GGTCTCATATC

as it's a bit more clear where the InDels are located.

I also changed target to reference, but I don't mind either term.

Please revise documentation/variable names as you see fit---we can then discuss updating the README. Thanks!

evanbiederstedt commented 5 years ago

After thinking about it a bit, I do think that at least ref_aln and query_aln should use a different symbol in place of spaces.

I believe something like:

alignments = {
    'cigar': cigar, 
    'NICE' : "{}\n{}\n{}".format(ref_aln, match_aln, query_aln),  ## f-strings only available from Python 3.6+ sadly
    'ref_aln' : ref_aln.replace(' ', '-'), 
    'match_aln': match_aln,
    'query_aln': query_aln.replace(' ', '-')
}

would result in these strings having -.

Let me know what you think :)

Martinsos commented 5 years ago

@evanbiederstedt I hope all is fine with you - are you still interested in finalizing this PR? I think we are close to done, would be a shame to let it drop!

evanbiederstedt commented 5 years ago

Hey @Martinsos

This is on the to-do list! Apologies so much time has passed---I've been busier than expected.

Let's see if I can finish this in the next 48 hours. Thanks for checking in though! I am definitely not dropping it.

Thanks, Evan

Martinsos commented 5 years ago

Hey no problem, I know how it is! Take your time, just wanted to check if all is ok :).

On Thu, Jun 27, 2019 at 8:06 PM evanbiederstedt notifications@github.com wrote:

Hey @Martinsos https://github.com/Martinsos

This is on the to-do list! Apologies so much time has passed---I've been busier than expected.

Let's see if I can finish this in the next 48 hours. Thanks for checking in though! I am definitely not dropping it.

Thanks, Evan

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Martinsos/edlib/pull/132?email_source=notifications&email_token=AALXFBZTCIBQGHZEY2QV2RLP4T6TLA5CNFSM4HN5HJT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYX5PLA#issuecomment-506451884, or mute the thread https://github.com/notifications/unsubscribe-auth/AALXFB7XZEY33IUM73HRXW3P4T6TLANCNFSM4HN5HJTQ .

evanbiederstedt commented 4 years ago

Hi @Martinsos

Sorry for the delay---it took far longer than 48 hours.

Please have a look at the recent commits.

There are a few things to discuss, and then we can finish this up:

(1) I think the function requires both the cigar and locations from the output of align(), right?

(2) I believe align() only outputs a cigar if task="path" in align(). If the above outline makes sense, I can write a simple check for this.

(3) The description of parameters could be improved.

Let me know how you think we could proceed. Thanks!

evanbiederstedt commented 4 years ago

Hi @Martinsos I think this is close to done. Do you have more recommendations for the finishing touches, so we can push this over the finish line?

Thanks!

evanbiederstedt commented 4 years ago

Hi @Martinsos Please take another look. Feel free to make changes to this branch :)

evanbiederstedt commented 4 years ago

However, here in python, you actually use the starting position (right?) which means we don't need to figure out the starting position, and therefore we don't need to know the mode, which is great.

I believe that is right, based on target_pos = alignResult["locations"][0][0], correct?

I also added a check here such that locations exists.

The only thing I would ask of you is to confirm that everything works fine for HW by running an example and checking its output.

It looks like it works:

>>> import edlib
>>> resultNW = edlib.align(query="TAAGGATGGTCCCATTC", target="AAGGGGTCTCATATC", mode="NW", task="path")
>>> edlib.getNiceAlignmentFromAlignResult(resultNW, query="TAAGGATGGTCCCATTC", target="AAGGGGTCTCATATC")
{'query_aligned': 'TAAGGATGGTCCCAT-TC', 'matched_aligned': '-||||--||||.|||-||', 'target_aligned': '-AAGG--GGTCTCATATC'}
>>> 
>>> resultHW = edlib.align(query="TAAGGATGGTCCCATTC", target="AAGGGGTCTCATATC", mode="HW", task="path")
>>> edlib.getNiceAlignmentFromAlignResult(resultHW, query="TAAGGATGGTCCCATTC", target="AAGGGGTCTCATATC")
{'query_aligned': 'TAAGGATGGTCCCAT-TC', 'matched_aligned': '-||||--||||.|||-||', 'target_aligned': '-AAGG--GGTCTCATATC'}
>>> resultSHW = edlib.align(query="TAAGGATGGTCCCATTC", target="AAGGGGTCTCATATC", mode="SHW", task="path")
>>> 
>>> edlib.getNiceAlignmentFromAlignResult(resultSHW, query="TAAGGATGGTCCCATTC", target="AAGGGGTCTCATATC")
{'query_aligned': 'TAAGGATGGTCCCAT-TC', 'matched_aligned': '-||||--||||.|||-||', 'target_aligned': '-AAGG--GGTCTCATATC'}

I am wondering, does it make sense to add support for the case when query and target are not strings but bytes or iterable hashable objects? We could try to print those also I guess, but that probably does not make much sense. Ok, I am alright with going just with strings for now.

I'm tempted to just use strings for now; if others have requests, we/they can incorporate within this code via PR.

Ah yes, and one thing is left: while I don't have real tests set up, I do have "poor man's" tests in shape of a script that runs few methods and checks if they failed. So please, add an example or two there, that will test this new function:

Please add additions to the README; that is very important for the user :) I think you have a stronger notion of what should be here than me.

So please, add an example or two there, that will test this new function: https://github.com/Martinsos/edlib/blob/master/bindings/python/test.py .

Tests are now added :)

evanbiederstedt commented 4 years ago

Hi @Martinsos I also added a few lines to the README

After typing this out, I do think getNiceAlignmentFromAlignResult() is too long of method name. Perhaps getNiceAlignment() would suffice. I don't feel strongly about this though.

Please look the changes over, and feel free to make any changes you see fit.

Thanks, Evan

evanbiederstedt commented 4 years ago

Hi @Martinsos

However, please check the suggestions I made and consider approving them.

I've accepted them all!

Also, in README, line 39, it says "Edlib has only one function" -> that is not true anymore. It would be great if you could modify that so it says that there is one main function, and then there is utility function -> the one which you implemented.

Done!

Before merging, please squash all the commits you did into one commit.

Ok, I could do this.

evanbiederstedt commented 4 years ago

Hi @Martinsos

Thanks for the help on this. Everything looks good from my end. I think you need to merge now, as I don't have permissions.

Martinsos commented 4 years ago

@evanbiederstedt awesome!! You are right, I forgot I have to merge -> done. Thanks for all the effort, and I am looking forward to future PRs :D!

evanbiederstedt commented 4 years ago

@Martinsos I'm very happy to collaborate again! (Fork deleted.)