apache / tsfile

Apache TsFile
https://tsfile.apache.org/
Apache License 2.0
104 stars 50 forks source link

Added an optimized matching framework for standard SQL's LIKE #237

Closed linxt20 closed 2 months ago

linxt20 commented 2 months ago

This work mainly involves modifying the implementation of like in tsfile to distinguish it from regexp and optimize it.

This implementation adds two core matching classes, LikePattern and LikeMatcher, as well as three matching methods, FjsMatcher, NFA, and DFA.

Matcher Overview

The main process is as follows:

  1. Pattern parsing:
    • Escape character processing
    • Statistical information calculation, such as min and max, suffix and prefix
    • Fuzzy suffix judgment
    • Select matchers based on data features: only contains % and characters, use string search FjsMatcher; when it contains _, use the state machine, NFA is the default state, and DFA is the optimized state
  2. Matching process
    • Filter out strings that are not within the length range according to the statistical information min, max
    • Filter out strings that do not meet the prefix and suffix information according to the statistical information suffix, prefix
    • Matching using matchers