Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

How to output pairwise alignment in python library? #127

Closed y9c closed 4 years ago

y9c commented 5 years ago

The edlib apps contains cpp code for render the alignment result into "NICE" alignment with printAlignment function.

But I can't find any document for this function in the python library. Is it possible to do that?

y9c commented 5 years ago
#!/usr/bin/env python3

import re

import edlib

def edlib2pair(query: str, ref: str, mode: str = "NW") -> str:
    """
    input:
    query and ref sequence

    output:
    TAAGGATGGTCCCAT TC
     ||||  ||||.||| ||
     AAGG  GGTCTCATATC
    """

    a = edlib.align(query, ref, mode=mode, task="path")
    ref_pos = a["locations"][0][0]
    query_pos = 0
    ref_aln = match_aln = query_aln = ""

    for step, code in re.findall("(\d+)(\D)", a["cigar"]):
        step = int(step)
        if code == "=":
            ref_aln += ref[ref_pos : ref_pos + step]
            ref_pos += step
            query_aln += query[query_pos : query_pos + step]
            query_pos += step
            match_aln += "|" * step
        elif code == "X":
            ref_aln += ref[ref_pos : ref_pos + step]
            ref_pos += step
            query_aln += query[query_pos : query_pos + step]
            query_pos += step
            match_aln += "." * step
        elif code == "D":
            ref_aln += ref[ref_pos : ref_pos + step]
            ref_pos += step
            query_aln += " " * step
            query_pos += 0
            match_aln += " " * step
        elif code == "I":
            ref_aln += " " * step
            ref_pos += 0
            query_aln += query[query_pos : query_pos + step]
            query_pos += step
            match_aln += " " * step
        else:
            pass

    return f"{ref_aln}\n{match_aln}\n{query_aln}"

if __name__ == "__main__":
    REF = "TAAGGATGGTCCCATTC"
    QUERY = "AAGGGGTCTCATATC"
    PAIR = edlib2pair(QUERY, REF, mode="NW")
    print(PAIR)
Martinsos commented 5 years ago

Hey @yech1990 , thanks for this! So there is no support currently for that as you already figure out. NICE is a feature of aligner app/cli, not of edlib library itself. But I am glad to see you managed to implement your own!

If you wish, you could add make a PR with this code correctly incorporated into edlib's python code and I will happy to review, I think it would be nice to have an extra method for this conversion from cigar to nice output.

y9c commented 5 years ago

Hi @Martinsos

I tried to write such a function into edlib.pyx, and noticed that apps/aligner use alignment object as the input, rather than cigar. But the cresult.alignment is always empty, even if it is True.

https://github.com/Martinsos/edlib/blob/f6971c2703b3bf70c1b428dd40904a0314bf35cc/bindings/python/edlib.pyx#L86

Could you please help me to figure out how to manipulate cresult.alignment.

evanbiederstedt commented 5 years ago

Hi @yech1990

Thank you for implementing this! I was just trying to figure out how to return the actual alignment from the python library.

I agree, it would be extremely helpful to have the Python API revised to easily access the actual alignment.

@Martinsos Do you have insight into the above? It would be great to have such a function in the python bindings. I second @yech1990's question :)

Thanks, Evan

Martinsos commented 5 years ago

@yech1990 and @evanbiederstedt thanks for the effort :)!

@yech1990 , I would encourage you to continue with the effort you started and to create a function in edlib.pyx that converts CIGAR into NICE alignment.

You noticed correctly that cpp version of edlib returns "alignment", while python version of edlib (align) returns "cigar". To be more correct, python version returns "extended cigar format" (let's just call it cigar). The thing is, cigar is just compressed version of "alignment" from cpp edlib, so there is no information lost, I would even say it is more powerful (and standard) format.

So don't worry about "alignment", just go with cigar you get from align python method, as you did in your implementation above, that is enough, you are not loosing anything.

To help you with shaping this PR, this is what I would like to see: A method cigarToNiceAlignment that would take cigar returned by align + whatever else it needs (query and target, maybe mode I am not sure at the moment, check cpp implementation of NICE) and returns string which is NICE alignment. That should be it! And it should mostly be code you already wrote here, just slightly modified.

evanbiederstedt commented 5 years ago

returns string which is NICE alignment

I believe this should be accessed via a key from the output of edlib.align(), e.g.

import edlib

result = edlib.align("elephant", "telephone")
print(result["editDistance"])  # 3
print(result["alphabetLength"])  # 8
print(result["locations"])  # [(None, 8)]
print(result["cigar"])  # None
print(result["NICE"]) 
## outputs
### TAAGGATGGTCCCAT TC
###  ||||  ||||.||| ||
###  AAGG  GGTCTCATATC

I wonder if the best solution is to output NICE as a list of strings; that would be one solution easy for Python users to utilize for downstream work. Something like the following:

print(result["NICE"][0])
## TAAGGATGGTCCCAT TC

print(result["NICE"][1])
###  ||||  ||||.||| ||

print(result["NICE"][2])
###  AAGG  GGTCTCATATC
Martinsos commented 5 years ago

@evanbiederstedt I answered the first question in the PR (I am for separate function approach, I explained there why), and regarding the second: I am not sure what is it that user needs, if you think it is better to separate it upfront, we can go with that. In that case, I would explain that well in the comments and also return structure with nice property names for each of those 3 segments, not just an array/triple.

Let's continue discussion in the PR!

evanbiederstedt commented 5 years ago

Hi @Martinsos

Let's continue discussion in the PR!

Agreed! Let's keep everything there :)

Martinsos commented 4 years ago

This was implemented with #132 , good work everyone!