facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.51k stars 1.15k forks source link

Support Oniguruma-based regex functions. #9897

Open spershin opened 5 months ago

spershin commented 5 months ago

Description

Proposition

This is a proposition for discussion of introducing a set of regex functions based on Oniguruma library.

The main reason for this is to use these functions in Presto. Presto mainly uses JONI to implement regex functions. And JONI is a Java port of Oniguruma. Note, that Presto has support for RE2J as well, but JONI is very well established and used in large companies like Meta.

We have done some investigation on what are the main differences in the production workload that currently stop us from migrating from JONI to RE2J and then to RE2 in Prestissimo. Note that we are far from covering the whole workload.

So far we found out 8 discrepancies:

Different Results Returned

  1. Word Boundary options \b (and \B) consider non-ASCII characters as word boundaries. In JONI non-ASCII characters (unclear all of them or just the alphabetic) are considered as part of word.
    JONI: SELECT regexp_like('Zürich', '\brich\b') returns false.
    RE2J: SELECT regexp_like('Zürich', '\brich\b') returns true.
  2. RE2J treats ß and ss them as different character sequences. JONI treats them as equal. Note, that '(?i)' seems necessary to observe the difference.
    JONI: select regexp_like('weiss', '(?i)Weiß'); returns true.
    RE2J: select regexp_like('weiss', '(?i)Weiß'); returns false.

Unsupported Features

  1. Multiple ? in a row are not supported.
    JONI: (\b)???????????(\b) works as (\b)(\b). It seems so, at least.
    RE2J: (\b)???????????(\b) fails: error parsing regexp: invalid nested repetition operator: `???`.
  2. Lookbehind pattern is not supported. Negative: (?<!xyz) and positive: (?<=xyz).
    JONI: Works as intended.
    RE2J: fails: error parsing regexp: invalid named capture: `(?<!RMA)$`.
  3. Lookahead pattern is not supported. Negative: (?!xyz) and positive: (?=xyz).
    JONI: Works as intended.
    RE2J: fails: error parsing regexp: invalid or unsupported Perl syntax: `(?!`.
  4. Some character classes are not supported in \p syntax.
    JONI: Works as intended.
    RE2J: fails: error parsing regexp: invalid character class range: `\p{Punct}`.
    Unsupported classes include: Punct, Cn. Maybe more (needs a bit of testing).
  5. Hex unicode sequence \u not supported.
    JONI: \u0900 works as intended.
    RE2J: fails: error parsing regexp: invalid escape sequence: `\u`.
  6. Unsupported multiple nested {}.
    JONI: {{{{7}}}} does something. however, unclear what exactly.
    RE2J: fails: error parsing regexp: invalid nested repetition operator: `{{7}`.

More Information

  1. RE2 is generally faster than Oni and we consider bringing Oni only for the sake of features. It can be made so we automatically chose either or for a query or that we let the user chose or have different names for regex functions.
  2. JONI is known for some regex patterns to cause a 'runaway' thread, i.e. executing 'indefinitely'. Oni might have a similar problem and that will need to be dealt with.
spershin commented 5 months ago

I've done some prototyping, so far implementing only the simplest regexp_like() (also the most used one). Was running comparison of Oni with JONI and found few discrepancies as well.

  1. [correctness][JONI] JONI is incorrect searching for case insensitive 'ß'. SELECT regexp_like('澳門WS團(Weiẞ Schwarz Of Macau)', '(?i)Weiß');. Oni: true, JONI: false, RE2: false.
  2. [correctness][JONI] JONI ignores hyphen in a character group if it is not escaped. SELECT regexp_like('csa-arch.co.uk', '^[a-z-.]+$'); incorrectly returns FALSE in JONI, but works properly in Oni and RE2.
  3. [fail][Oniguruma] Oniguruma does not accept '^?': SELECT regexp_like('0 0 * * 2,6', '^? ? \* \* [0-6]|@weekly');. This one is probably JONI's issue because pattern '^?' is not valid and useless as it just skips arbitrary number of characters from the very beginning and equivalent to not specifying anything.
  4. [fail][Oniguruma] error parsing regexp: invalid escape sequence: \u: select regexp_like('claim_text', '^([\u0020-\u02AF\u2000-\u20CF])*$');

Starting to work on implementing regexp_replace() (the 2nd most used function) to see how it fares. It is harder than RE2, because RE2 has a function that we all and Oni does not - we need to implement replace code ourselves.