Martinsos / edlib

Lightweight, super fast C/C++ (& Python) library for sequence alignment using edit (Levenshtein) distance.
http://martinsos.github.io/edlib
MIT License
493 stars 162 forks source link

start location using HW is None? #83

Closed nextgenusfs closed 7 years ago

nextgenusfs commented 7 years ago

I'm just trying the Python API noticing that the start location is always None, is the the intended behavior?

>>> revalign = edlib.align('GCATATCAATAAGCGGAGGA', 'ATACCCCCCTATCTTAATCATATCAATACGCGGAGGAGTATCGGAAGCGCACCAGG', mode="HW")
>>> revalign
{'editDistance': 2, 'cigar': None, 'locations': [(None, 36)], 'alphabetLength': 4}
nextgenusfs commented 7 years ago

I tried adding the task="path" and then it works the way I would have expected.

>>> revalign = edlib.align('GCATATCAATAAGCGGAGGA', 'ATACCCCCCTATCTTAATCATATCAATACGCGGAGGAGTATCGGAAGCGCACCAGG', mode="HW", task="path") 
>>> revalign
{'editDistance': 2, 'cigar': u'1X10=1X8=', 'locations': [(17, 36)], 'alphabetLength': 4}
nextgenusfs commented 7 years ago

Is there a speed cost associated with using task="path"?

nextgenusfs commented 7 years ago

Answered my own question in the help menu, thanks!

align(...)
    Align query with target using edit distance.
    @param {string} query
    @param {string} target
    @param {string} mode  Optional. Alignment method do be used. Possible values are:
            - 'NW' for global (default)
            - 'HW' for infix
            - 'SHW' for prefix.
    @param {string} task  Optional. Tells edlib what to calculate. Less there is to calculate,
            faster it is. Possible value are (from fastest to slowest):
            - 'distance' - find edit distance and end locations in target. Default.
            - 'locations' - find edit distance, end locations and start locations.
            - 'path' - find edit distance, start and end locations and alignment path.
    @param {int} k  Optional. Max edit distance to search for - the lower this value,
            the faster is calculation. Set to -1 (default) to have no limit on edit distance.
    @return Dictionary with following fields:
            {int} editDistance  -1 if it is larger than k.
            {int} alphabetLength
            {[(int, int)]} locations  List of locations, in format [(start, end)].
            {string} cigar  Cigar is a standard format for alignment path.
                Here we are using extended cigar format, which uses following symbols:
                Match: '=', Insertion to target: 'I', Deletion from target: 'D', Mismatch: 'X'.
                e.g. cigar of "5=1X1=1I" means "5 matches, 1 mismatch, 1 match, 1 insertion (to target)".
Martinsos commented 7 years ago

I am glad you managed to solve it on your own! Yes, cost associated with using task=path is not trivial, as you have probably already noticed, but it also depends on the size of the input data. If query is small compared to target, using task=path should have no impact on speed.