Closed mensfeld closed 7 years ago
do you know any paper or doc that describes such an algorithm? Also I do not understand use cases very well, can you explain?
Hello @dimus
Thank you for answering sooo fast.
First of all, let me apologize for not providing better descriptions earlier. Also I would like to thank you for this library as it made my life much easier :)
I don't have any papers, but I believe, that it is called Needlemana-Wunsch algorithms (not 100% sure thought).
My use case is pretty simple: I want to do exactly what Damerau-Levenshtein does but with wages, so the changes that are closer on the keyboard get lower score than changes that require a bigger effort to make.
For example:
dl.distance("Something", "Simething") #=> 1
dl.distance("Something", "Sqmething") #=> 1
They both differ by 1 character (o
replaced with either i
or q
) but i
is right next to the o
on the keyboard, which means that it is much easier to make a mistake like that.
I've developed (not yet open source as it is WIP) a mechanism to detect typosquatting occurrences for rubygems libraries used in Gemfiles and a variation of this algorithm that would provide me with scoring like that would significantly lower the number of fake positives.
For now I use following code:
https://github.com/audy/phylograph/blob/master/lib/needleman.rb
But it's not exactly what I aim for.
How I understand, Levenshtein algorithm is fine for what you want. So may be something like this should work:
Find out each edit distance event for each character with https://github.com/GlobalNamesArchitecture/damerau-levenshtein#dameraulevenshteindiffer
Figure out the matrix of querty keyboard of each letter for each letter depending on how easy it is to make a typo (it probably exists somewhere??)
Map every edit event with the "querty typo" matrix.
@dimus oh! Didn't think about using the differ that way. Thank you!
Hey guys,
I have a feature request: would it be possible to wage based on the qwerty keys distance from each other? This would be super useful to detect typosquatting.