This PR resolves several problems described in issue #24788. It is focused on fixing bugs but fixing this bugs can cause program behavior changes. Here are the user-facing impacts:
Chapel's use of RE2 was inadvertently activating longest match mode for most patterns. This PR fixes this; longest match mode is now only activated by posix mode, which matches the C++ RE2 Posix "canned option".
nonGreedy was not implemented correctly. Previously, it disabled longest match, but instead it should only impact whether patterns like x* prefer more or fewer repetitions.
multiLine was not implemented correctly. Previously, it had no effect, unless using posix mode.
RE2's posix mode defaults to multiLine mode, but the initializer arguments did not convey that. This PR adjusts it to have multiLine=posix as a default.
improved the documentation for regex.init to clarify several behaviors
improved the documentation for several regex replacement methods to describe the supported replacement sequences like \1
Implementation details:
Add (?U) or (?m) to the front of a pattern to implement nonGreedy or multiLine
now the struct re_t includes flags indicating which flags were set and an integer indicating how many pattern bytes are implementing flags. The flags are easy to compare in the code checking if we already have a compiled regex.
additionally, tidied up local_cache_get to remove repeated access to the same arrays
qio_regex_borrow_pattern now uses the number of bytes implementing flags, so that when casting a regex to string we can leave out the (?U) and/or (?m) that the implementation added
while there, I noticed some inconsistent ways in which the void* is cast. I changed it to consistently cast to a re_t*. (Previously, sometimes the void* was cast to RE2*, which worked because the RE2 field is the first in re_t, but this seems needlessly confusing).
adjusted a few tests that were relying upon longest-match behavior
adds new tests for each of the regex initializer flags
This PR resolves several problems described in issue #24788. It is focused on fixing bugs but fixing this bugs can cause program behavior changes. Here are the user-facing impacts:
posix
mode, which matches the C++ RE2 Posix "canned option".nonGreedy
was not implemented correctly. Previously, it disabled longest match, but instead it should only impact whether patterns likex*
prefer more or fewer repetitions.multiLine
was not implemented correctly. Previously, it had no effect, unless usingposix
mode.multiLine=posix
as a default.regex.init
to clarify several behaviors\1
Implementation details:
(?U)
or(?m)
to the front of a pattern to implementnonGreedy
ormultiLine
struct re_t
includes flags indicating which flags were set and an integer indicating how many pattern bytes are implementing flags. The flags are easy to compare in the code checking if we already have a compiled regex.local_cache_get
to remove repeated access to the same arraysqio_regex_borrow_pattern
now uses the number of bytes implementing flags, so that when casting a regex to string we can leave out the(?U)
and/or(?m)
that the implementation addedvoid*
is cast. I changed it to consistently cast to are_t*
. (Previously, sometimes thevoid*
was cast toRE2*
, which worked because theRE2
field is the first inre_t
, but this seems needlessly confusing).For issue #24788.
Reviewed by @ShreyasKhandekar - thanks!