Kalkwst / MicroLib

MIT License
3 stars 0 forks source link

:sparkles: Text: Add Jaro-Winkler Distance and Similarity Calculation #6

Closed Kalkwst closed 1 year ago

Kalkwst commented 1 year ago

The Jaro-Winkler similarity is a measure of the similarity between two strings. It is based on the number of matching characters between the strings, with a higher score indicating a higher degree of similarity. The Jaro-Winkler distance is simply 1.0 minus the Jaro-Winkler similarity.

The Jaro-Winkler similarity can be thought of as a measure of how closely two strings match. For example, if you were comparing the strings "apple" and "apples", the Jaro-Winkler similarity would be high because there are a lot of matching characters between the two strings. On the other hand, if you were comparing the strings "apple" and "banana", the Jaro-Winkler similarity would be low because there are very few matching characters between the two strings.

The Jaro-Winkler distance works in a similar way, with a low distance indicating a high degree of similarity and a high distance indicating a low degree of similarity. For example, the Jaro-Winkler distance between "apple" and "apples" would be low, while the Jaro-Winkler distance between "apple" and "banana" would be high.

The solution is comprised of two functions.

The Matches function compares two strings and returns an array with three elements: the number of matching characters between the two strings, the number of half transpositions (when two characters are swapped in the other string), and the number of matching characters at the start of the strings (the prefix). It does this by setting max to the longer of the two strings and min to the shorter of the two strings, then initializing an array of matching indices and a boolean array of matched characters for the max string. It then iterates through each character in the min string and for each character, searches for a match in the max string within a certain range. If a match is found, the matching index and character are added to their respective arrays and the matches count is incremented. After all matching indices and characters have been collected, the number of half transpositions is calculated by comparing the matching characters. Finally, the number of matching prefix characters is counted.

The Calculate method calculates the Jaro-Winkler similarity between two strings left and right. If the strings are equal, it returns 1. Otherwise, it calls the Matches method with the two strings and calculates the Jaro similarity. If the Jaro similarity is less than 0.7, it returns the Jaro similarity. Otherwise, it returns the Jaro similarity plus a default scaling factor multiplied by the number of common prefix characters between the two strings (up to the first 4 characters) multiplied by the difference between 1 and the Jaro similarity.

The Jaro-Winkler distance is just the inverse of the similarity