Closed y9c closed 4 years ago
#!/usr/bin/env python3
import re
import edlib
def edlib2pair(query: str, ref: str, mode: str = "NW") -> str:
"""
input:
query and ref sequence
output:
TAAGGATGGTCCCAT TC
|||| ||||.||| ||
AAGG GGTCTCATATC
"""
a = edlib.align(query, ref, mode=mode, task="path")
ref_pos = a["locations"][0][0]
query_pos = 0
ref_aln = match_aln = query_aln = ""
for step, code in re.findall("(\d+)(\D)", a["cigar"]):
step = int(step)
if code == "=":
ref_aln += ref[ref_pos : ref_pos + step]
ref_pos += step
query_aln += query[query_pos : query_pos + step]
query_pos += step
match_aln += "|" * step
elif code == "X":
ref_aln += ref[ref_pos : ref_pos + step]
ref_pos += step
query_aln += query[query_pos : query_pos + step]
query_pos += step
match_aln += "." * step
elif code == "D":
ref_aln += ref[ref_pos : ref_pos + step]
ref_pos += step
query_aln += " " * step
query_pos += 0
match_aln += " " * step
elif code == "I":
ref_aln += " " * step
ref_pos += 0
query_aln += query[query_pos : query_pos + step]
query_pos += step
match_aln += " " * step
else:
pass
return f"{ref_aln}\n{match_aln}\n{query_aln}"
if __name__ == "__main__":
REF = "TAAGGATGGTCCCATTC"
QUERY = "AAGGGGTCTCATATC"
PAIR = edlib2pair(QUERY, REF, mode="NW")
print(PAIR)
Hey @yech1990 , thanks for this! So there is no support currently for that as you already figure out. NICE is a feature of aligner app/cli, not of edlib library itself. But I am glad to see you managed to implement your own!
If you wish, you could add make a PR with this code correctly incorporated into edlib's python code and I will happy to review, I think it would be nice to have an extra method for this conversion from cigar to nice output.
Hi @Martinsos
I tried to write such a function into edlib.pyx
, and noticed that apps/aligner
use alignment
object as the input, rather than cigar
. But the cresult.alignment
is always empty, even if it is True.
Could you please help me to figure out how to manipulate cresult.alignment
.
Hi @yech1990
Thank you for implementing this! I was just trying to figure out how to return the actual alignment from the python library.
I agree, it would be extremely helpful to have the Python API revised to easily access the actual alignment.
@Martinsos Do you have insight into the above? It would be great to have such a function in the python bindings. I second @yech1990's question :)
Thanks, Evan
@yech1990 and @evanbiederstedt thanks for the effort :)!
@yech1990 , I would encourage you to continue with the effort you started and to create a function in edlib.pyx that converts CIGAR into NICE alignment.
You noticed correctly that cpp version of edlib returns "alignment", while python version of edlib (align) returns "cigar". To be more correct, python version returns "extended cigar format" (let's just call it cigar). The thing is, cigar is just compressed version of "alignment" from cpp edlib, so there is no information lost, I would even say it is more powerful (and standard) format.
So don't worry about "alignment", just go with cigar you get from align
python method, as you did in your implementation above, that is enough, you are not loosing anything.
To help you with shaping this PR, this is what I would like to see: A method cigarToNiceAlignment
that would take cigar returned by align
+ whatever else it needs (query
and target
, maybe mode
I am not sure at the moment, check cpp implementation of NICE) and returns string which is NICE alignment. That should be it! And it should mostly be code you already wrote here, just slightly modified.
returns string which is NICE alignment
I believe this should be accessed via a key from the output of edlib.align()
, e.g.
import edlib
result = edlib.align("elephant", "telephone")
print(result["editDistance"]) # 3
print(result["alphabetLength"]) # 8
print(result["locations"]) # [(None, 8)]
print(result["cigar"]) # None
print(result["NICE"])
## outputs
### TAAGGATGGTCCCAT TC
### |||| ||||.||| ||
### AAGG GGTCTCATATC
I wonder if the best solution is to output NICE as a list of strings; that would be one solution easy for Python users to utilize for downstream work. Something like the following:
print(result["NICE"][0])
## TAAGGATGGTCCCAT TC
print(result["NICE"][1])
### |||| ||||.||| ||
print(result["NICE"][2])
### AAGG GGTCTCATATC
@evanbiederstedt I answered the first question in the PR (I am for separate function approach, I explained there why), and regarding the second: I am not sure what is it that user needs, if you think it is better to separate it upfront, we can go with that. In that case, I would explain that well in the comments and also return structure with nice property names for each of those 3 segments, not just an array/triple.
Let's continue discussion in the PR!
Hi @Martinsos
Let's continue discussion in the PR!
Agreed! Let's keep everything there :)
This was implemented with #132 , good work everyone!
The edlib apps contains cpp code for render the alignment result into "NICE" alignment with
printAlignment
function.But I can't find any document for this function in the python library. Is it possible to do that?