GlobalNamesArchitecture / damerau-levenshtein

Calculates edit distance using Damerau-Levenshtein algorithm
MIT License
144 stars 19 forks source link

Waged version for qwerty keyboards #12

Closed mensfeld closed 7 years ago

mensfeld commented 7 years ago

Hey guys,

I have a feature request: would it be possible to wage based on the qwerty keys distance from each other? This would be super useful to detect typosquatting.

dimus commented 7 years ago

do you know any paper or doc that describes such an algorithm? Also I do not understand use cases very well, can you explain?

mensfeld commented 7 years ago

Hello @dimus

Thank you for answering sooo fast.

First of all, let me apologize for not providing better descriptions earlier. Also I would like to thank you for this library as it made my life much easier :)

I don't have any papers, but I believe, that it is called Needlemana-Wunsch algorithms (not 100% sure thought).

My use case is pretty simple: I want to do exactly what Damerau-Levenshtein does but with wages, so the changes that are closer on the keyboard get lower score than changes that require a bigger effort to make.

For example:

dl.distance("Something", "Simething") #=> 1
dl.distance("Something", "Sqmething") #=> 1

They both differ by 1 character (o replaced with either i or q) but i is right next to the o on the keyboard, which means that it is much easier to make a mistake like that.

I've developed (not yet open source as it is WIP) a mechanism to detect typosquatting occurrences for rubygems libraries used in Gemfiles and a variation of this algorithm that would provide me with scoring like that would significantly lower the number of fake positives.

For now I use following code:

https://github.com/audy/phylograph/blob/master/lib/needleman.rb

But it's not exactly what I aim for.

dimus commented 7 years ago

How I understand, Levenshtein algorithm is fine for what you want. So may be something like this should work:

  1. Find out each edit distance event for each character with https://github.com/GlobalNamesArchitecture/damerau-levenshtein#dameraulevenshteindiffer

  2. Figure out the matrix of querty keyboard of each letter for each letter depending on how easy it is to make a typo (it probably exists somewhere??)

  3. Map every edit event with the "querty typo" matrix.

mensfeld commented 7 years ago

@dimus oh! Didn't think about using the differ that way. Thank you!