apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.2k stars 1.21k forks source link

Allow user configurable regex library for queries #13005

Closed itschrispeck closed 2 weeks ago

itschrispeck commented 1 month ago

This PR addresses https://github.com/apache/pinot/issues/12628

This provides functionality to configure a regex library that will be used during query execution.

A new config is added: pinot.server.query.regex.class Valid values for this config are: JAVA_UTIL or RE2J (it might be better to accept class names instead?).

Some testing showed regex libraries have very different strengths/weaknesses, so it seemed best to allow users to choose an implementation that works well for them.

Default behavior is unchanged.

tag: feature

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 93.75000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 62.20%. Comparing base (59551e4) to head (72cff23). Report is 431 commits behind head on master.

Files Patch % Lines
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 2 Missing :warning:
...rg/apache/pinot/common/utils/regex/RegexClass.java 83.33% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #13005 +/- ## ============================================ + Coverage 61.75% 62.20% +0.44% + Complexity 207 198 -9 ============================================ Files 2436 2521 +85 Lines 133233 137901 +4668 Branches 20636 21327 +691 ============================================ + Hits 82274 85776 +3502 - Misses 44911 45734 +823 - Partials 6048 6391 +343 ``` | [Flag](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [custom-integration1](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-0.01%)` | :arrow_down: | | [integration](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-0.01%)` | :arrow_down: | | [integration1](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `<0.01% <0.00%> (-0.01%)` | :arrow_down: | | [integration2](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `0.00% <0.00%> (ø)` | | | [java-11](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.15% <93.75%> (+0.44%)` | :arrow_up: | | [java-21](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.07% <93.75%> (+0.45%)` | :arrow_up: | | [skip-bytebuffers-false](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.18% <93.75%> (+0.43%)` | :arrow_up: | | [skip-bytebuffers-true](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.04% <93.75%> (+34.31%)` | :arrow_up: | | [temurin](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.20% <93.75%> (+0.44%)` | :arrow_up: | | [unittests](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `62.19% <93.75%> (+0.44%)` | :arrow_up: | | [unittests1](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `46.84% <97.82%> (-0.05%)` | :arrow_down: | | [unittests2](https://app.codecov.io/gh/apache/pinot/pull/13005/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `27.77% <0.00%> (+0.04%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

richardstartin commented 1 month ago

Why not just replace the implementation to keep the interface simple and easy to use? If Java's built in regular expressions have the edge somewhere, it's probably a pretty small edge, and it would be nice to avoid backtracking by default.

itschrispeck commented 1 month ago

Why not just replace the implementation to keep the interface simple and easy to use? If Java's built in regular expressions have the edge somewhere, it's probably a pretty small edge, and it would be nice to avoid backtracking by default.

I don't have a good understanding of how Pinot's users use regex, but from searching it seems that the median latency/memory usage of re2j is higher than Java's built in library. It'd also become a backwards incompatible change due to feature gaps. In general I don't think users would be primarily using Pinot for regex filtering (i.e. should we have an optimized path for the average case or focus on worst case performance?), but that's my conjecture.

Some references: https://github.com/DaniilRoman/re2j_test https://github.com/google/re2j/issues/162 https://github.com/google/re2j/issues/12

If there is a consensus that an outright replacement is acceptable I'm happy to go that route.

deemoliu commented 1 month ago

+1 on making this configurable in the first version. And this change, it will be easier for us to benchmark with different library.

There are some known cases re2j doesn't outperform the current implementation, so it's a good idea to on hold the replacement of the existing library.

We can do another thorough analysis on when re2j is more efficient and when it's not, and make a proposal on the default behavior of regex.