ezorita / seeq

DNA/RNA pattern matching algorithm
GNU General Public License v3.0
21 stars 5 forks source link

Give a formal definition of match count #7

Open gui11aume opened 9 years ago

gui11aume commented 9 years ago

There are several possible definitions of the number of hits when using the Levenshtein distance. The number of positions in the text with a distance less than the threshold is not very useful because every perfect match is flanked with 2 matches with distance at most 1, 2 matches with distance at most 2 etc.

Criteria for counting the matches should be clearly defined. For instance, the pattern GATC has six 2-matches int the text GAATTC and no 1-match. Also, the pattern can have some repeat, which creates additional issues. For instance, the pattern CATC has two perfect but overlapping matches in the text CATCATC.

ezorita commented 9 years ago

My definition was non-overlapping match. So any position of the text can belong only to one match. The algorithm should resume the search one position after the previous match end.

Eduard On May 16, 2015 17:06, "Guillaume Filion" notifications@github.com wrote:

There are several possible definitions of the number of hits when using the Levenshtein distance. The number of positions in the text with a distance less than the threshold is not very useful because every perfect match is flanked with 2 matches with distance at most 1, 2 matches with distance at most 2 etc.

Criteria for counting the matches should be clearly defined. For instance, the pattern GATC has six 2-matches int the text GAATTC and no 1-match. Also, the pattern can have some repeat, which creates additional issues. For instance, the pattern CATC has two perfect but overlapping matches in the text CATCATC.

— Reply to this email directly or view it on GitHub https://github.com/ezorita/seeq/issues/7.

gui11aume commented 9 years ago

The definition is not specific enough. For instance, in the first example there could be two 2-matches (GAA and TTC) or just one (any of GAAT, AATT, ATTC).