SYSTRAN / fuzzy-match

Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.
MIT License
45 stars 8 forks source link

Move min_seq_len to NGramMatches constructor. #26

Closed ClementChouteau closed 3 years ago

guillaumekln commented 3 years ago

When the pattern size is 1, the previous code bypassed these length checks:

https://github.com/SYSTRAN/fuzzy-match/blob/f5c48febe6969806ba620d15fa14affb6702bb2e/src/fuzzy_match.cc#L465-L477

Do we have a test for this case to make sure the behavior is unchanged?

ClementChouteau commented 3 years ago

Thanks for spotting this. @guillaumekln There is a test small_sentence_matches where we check a single token match.

The caller sets match_length to 1. min_exact_match is always less or equal to 1 when p_length == 1, see

  unsigned compute_min_exact_match(float fuzzy, unsigned p_length)
  {
    const auto differences = (unsigned)std::ceil(p_length * (1.f - fuzzy));
    // we split (p_length - differences) in  (differences + 1) parts
    // the minimum value of the largest part size is obtained by dividing and taking ceil
    return std::ceil((p_length - differences) / (differences + 1.));
  }

_min_seq_len is at least 1.

Therefore we never do the "lazy injection feature" return, and the behavior is the same.