Closed itschrispeck closed 2 weeks ago
Attention: Patch coverage is 93.75000%
with 3 lines
in your changes are missing coverage. Please review.
Project coverage is 62.20%. Comparing base (
59551e4
) to head (72cff23
). Report is 431 commits behind head on master.
Files | Patch % | Lines |
---|---|---|
.../pinot/server/starter/helix/BaseServerStarter.java | 0.00% | 2 Missing :warning: |
...rg/apache/pinot/common/utils/regex/RegexClass.java | 83.33% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Why not just replace the implementation to keep the interface simple and easy to use? If Java's built in regular expressions have the edge somewhere, it's probably a pretty small edge, and it would be nice to avoid backtracking by default.
Why not just replace the implementation to keep the interface simple and easy to use? If Java's built in regular expressions have the edge somewhere, it's probably a pretty small edge, and it would be nice to avoid backtracking by default.
I don't have a good understanding of how Pinot's users use regex, but from searching it seems that the median latency/memory usage of re2j is higher than Java's built in library. It'd also become a backwards incompatible change due to feature gaps. In general I don't think users would be primarily using Pinot for regex filtering (i.e. should we have an optimized path for the average case or focus on worst case performance?), but that's my conjecture.
Some references: https://github.com/DaniilRoman/re2j_test https://github.com/google/re2j/issues/162 https://github.com/google/re2j/issues/12
If there is a consensus that an outright replacement is acceptable I'm happy to go that route.
+1 on making this configurable in the first version. And this change, it will be easier for us to benchmark with different library.
There are some known cases re2j doesn't outperform the current implementation, so it's a good idea to on hold the replacement of the existing library.
We can do another thorough analysis on when re2j is more efficient and when it's not, and make a proposal on the default behavior of regex.
This PR addresses https://github.com/apache/pinot/issues/12628
This provides functionality to configure a regex library that will be used during query execution.
PatternFactory.compile(...)
is used in place ofPattern.compile(...)
Pattern
interface added maintains the same semantics asjava.util.Pattern
Matcher
interface added maintains the same semantics asjava.util.Matcher
A new config is added:
pinot.server.query.regex.class
Valid values for this config are:JAVA_UTIL
orRE2J
(it might be better to accept class names instead?).Some testing showed regex libraries have very different strengths/weaknesses, so it seemed best to allow users to choose an implementation that works well for them.
Default behavior is unchanged.
tag:
feature