facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.38k stars 1.11k forks source link

why velox supports regex function using re2 instead of hyperscan. #9823

Open liuyongvs opened 3 months ago

liuyongvs commented 3 months ago

Description

why velox supports regex function using re2 instead of hyperscan. The hyperscan supports simd https://github.com/intel/hyperscan

References:

mbasmanova commented 3 months ago

@liuyongvs We haven't looked into using hyperscan. Are you see significant performance improvement when replacing RE2 with hyperscan?

CC: @kgpai @rui-mo @PHILO-HE

mbasmanova commented 3 months ago

@liuyongvs I read Hyperscan: Regex Set Scanning with Hyperscan and RE2::Set By Justin Viiret blog post. Based on that post, it doesn't seem like hyperscan is a good fit for implementing regex SQL functions.

One of Hyperscan’s core features is its ability to scan efficiently for sets of regular expressions, compiling them all into an optimized database designed to scan for them all simultaneously.

Existing SQL functions apply a single regex to the input, not a set of regexes. If there were functions that do apply multiple regexes than hyperscan could be an interesting alternative.

Unrelated, but interesting point about the engines:

Both RE2 and Hyperscan are automata-based regex engines, meaning that they do not use backtracking matcher implementations, which is what causes regex engines like those used by [Perl](https://www.perl.org/) and [PCRE](http://pcre.org/) to have exponential time worst-case complexity. The libraries support similar pattern syntax; Hyperscan follows PCRE pattern syntax exactly (although does not implement all the constructs and will fail with an error for unsupported constructs) while RE2 makes some variations to PCRE pattern syntax.
FelixYBW commented 3 months ago

Another concern is the functional coverage. @codyschierbeck has observed re2 patterns and results aren't 100% compatible with Spark. If hyperscan is better on this, we may try.

PHILO-HE commented 3 months ago

The hyperscan community appears to be inactive now. I'm afraid it is no longer under development.

liuyongvs commented 3 months ago

The hyperscan community appears to be inactive now. I'm afraid it is no longer under development.

@PHILO-HE i found clickhouse supports hyperscan by user defined function like multiMatchAny/multiMatchAnyIndex/.. https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions#regexpextract and fork the hyperscan code,which is inactive now https://github.com/VectorCamp/vectorscan

i think @PHILO-HE velox can support it, what do you think?

PHILO-HE commented 3 months ago

@liuyongvs, I'm not sure. Do you have any replying on the above comments from mbasmanova?

BTW, posted a blog that shows performance of different engines. https://rust-leipzig.github.io/regex/2017/03/28/comparison-of-regex-engines/

mbasmanova commented 3 months ago

Another concern is the functional coverage. @codyschierbeck has observed re2 patterns and results aren't 100% compatible with Spark. If hyperscan is better on this, we may try.

@FelixYBW @codyschierbeck You may want to consider https://github.com/kkos/oniguruma for better coverage at the expense of reliability.

FelixYBW commented 3 months ago

Thank you. We may take a look later. Now in Gluten we use a config to enable re2 offload. Gluten user needs to take the risk if they enable.

liuyongvs commented 3 months ago

@mbasmanova @FelixYBW @PHILO-HE i found clickhouse and starrocks also supports hyerpscan to speed

  1. Intel's Hyperscan open-source library has not been updated much lately, and another team has forked Hyperscan to continue maintaining it. The code is available at https://github.com/VectorCamp/vectorscan. ClickHouse is using the forked version called vectorscan.

  2. I researched yesterday and found that both ClickHouse and StarRocks support Hyperscan. ClickHouse supports multiple patterns through custom non-standard functions, adding several functions. For more details, see https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions. image

  3. StarRocks also supports regexp_replace and regex using Hyperscan, but for single patterns. See https://github.com/StarRocks/starrocks/pull/3680. It is said that the performance has improved significantly. image

FelixYBW commented 3 months ago

Thank you, @liuyongvs Feel free to submit a PR to add it. We can test the compatibility to Spark, as well as the perf gain compared to re2 from customer workloads.

FelixYBW commented 3 months ago

@PHILO-HE does vcpkg support hyperscan/vectorscan?

codyschierbeck commented 3 months ago

I've recently finished collecting micro-benchmarking performance metrics for RE2 vs Boost in parse_url, and have code in place to measure this against hyperscan as well. This week I will work on implementing hyperscan with the libraries you guys have provided and compare against RE2 and Boost.

PHILO-HE commented 3 months ago

@PHILO-HE does vcpkg support hyperscan/vectorscan?

@FelixYBW, yes, I note hyperscan is supported in vcpkg, but vectorscan is not. https://github.com/microsoft/vcpkg/tree/261dd6831673d3b07e5118261f7e161a21d1a759/ports/hyperscan

mbasmanova commented 3 months ago

Intel's Hyperscan open-source library has not been updated much lately, and another team has forked Hyperscan to continue maintaining it.

@FelixYBW Binwei, would you know if / why Intel abandoned this library?

mbasmanova commented 3 months ago

Hyperscan is generally vulnerable to regular expression denial of service (ReDoS) attacks (e.g. see (here)[https://www.usenix.org/conference/usenixsecurity22/presentation/turonova], (here)[https://doi.org/10.1007/s10664-021-10033-1] and (here)[https://doi.org/10.1145/3236024.3236027]. Users are adviced to check the provided patterns carefully.

This quote from ClickHouse docs sounds pretty scary. It is not realistic to expect users to "check the provided patterns carefully".

FelixYBW commented 3 months ago

@FelixYBW Binwei, would you know if / why Intel abandoned this library?

Not sure. Looks the main contributor already left.