SYSTRAN / fuzzy-match

Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
MIT License
45 stars 8 forks source link

Fix and clarify the marking of matched words in the pattern #33

Closed guillaumekln closed 3 years ago

guillaumekln commented 3 years ago

The first stage of the algorithm is iterating over all possible n-grams and finding the matching suffixes. Then we get the corresponding sentences and for each sentence we track the words from the pattern that are also in the sentence.

The bug: when marking the matched words in the pattern, the code was iterating from word index 0 to the n-gram length, but the n-gram has an offset in the pattern. Apparently this bug existed since the first version of this project.

The PR fixes that and tries to improve the naming to make the intent clearer.