Add PCRE2_EXTRA_VANILLA_SYNTAX to disable PCRE2 extensions

NWilson commented 2 months ago

!!!

BIKESHEDDING / ARGUMENT WARNING

!!!

This PR is harmless in principle, but I expect people will have different opinions on the details.

Personally, I don't mind much.

Goal

Add an PCRE2_EXTRA_VANILLA_SYNTAX to remove syntax which is specific to PCRE2. It will be treated as if PCRE2 didn't even parse the syntax, giving the same syntax errors you'd get from (say) Perl.

The goal here is to remove things which are invented by PCRE2 - not because it's bad (I'm not passing judgement), but simply so that users who want a more "vanilla" syntax can do so.

Things that are in .NET, Ruby (oniguruma), Perl, ... are all OK, because they're not invented by PCRE2. So this option isn't bringing PCRE2 into alignment with Perl (which would be stricter than this PR). I'm merely restricting it from offering syntax that's not supported by any other engine.

First draft includes:

The pre-pattern syntax such as (*NOTEMPTY)
PCRE2-specific inline pattern options such as (?aX)
PCRE2 version detection such as (?(VERSION>=10.4)yes|no)
Callouts such as (?C0)
scan_substring
non_atomic_positive_look(ahead|behind)

NWilson commented 2 months ago

I'm considering throwing in a change to pcre2_substitute() under the same flag - unless people think it should have its own flag? (Probably it's too minor to care about getting its own flag.)

Proposed:

Change pcre2_substitute() with PCRE2_EXTRA_VANILLA_SYNTAX so that named capture groups referenced as $name are errors (NB: $1 and ${name} will be unaffected). This is the only syntax I could find in pcre2_substitute() which is completely unique to PCRE2.

The $name syntax also unfortunate from a backwards-compat point of view, since it appears to be (perhaps?) the only place where a group name is used without delimiters. This means that upgrading the Unicode version (and hence extending the Letters class) introduces an absolutely minuscule possibility of changing the behaviour, if someone has a replacement of the form $nameFOO where FOO is some character that Unicode want to add to the Letters class.

zherczeg commented 2 months ago

I am, in general, don't support this direction for various reasons:

This is not useful for users, they usually prefer more features than less
It will not achieve any compatibility with other engines. People can still create patters, which behave differently in other engines
Maintenance burden: we will keep forgetting to update this flag. Both when something become common, or we add something new

I would only support this, if all other engines would have the same feature, and there would be a standard body (even if it is informal), which decides the features in this list.

Another idea: there could be a tool, which parses patterns, and warn about the differences between engines. This would offer more help for people, rather than randomly failing parsing.

PhilipHazel commented 2 months ago

Like Zoltan, I am not keen The engines are constantly evolving and trying to decide what is "common" and what is not, and keeping it up to date sounds like a lot of effort for not much gain. What do we do if, in future, Perl adds non-atomic positive lookaheads? In the past, PCRE implemented recursion before Perl did (don't know if it preceded any other engines), and I think it had capture group names using the Python syntax (?P before Perl added them. If somebody has the time, then I think users would much more appreciate up-to-date documentation in exactly what each engine supports, but maintaining such a list could be very time consuming.

NWilson commented 2 months ago

OK. That's quite fair.

I'll let this PR hang out here for a while before I close it.

The motivation was: what if you want to expose regexes in a product (such as Excel). BUT you want to lock down the API surface to something that's really solid, and could be implemented by third parties even. So we'd offer regex features that are widely supported across multiple engines, and work consistently, such that the regex dialect could be supported on ~any backend via syntax translation. If we were to expose the entirety of PCRE2, then we would force other implementors (within Microsoft!) to use exactly the same regex library if they want to achieve compatibility.

I guess you'd suggest instead what I've been arguing for internally: build a pre-parser that sits in front of PCRE2, so that the application shipping the feature "owns" the syntax that it considers valid, rather than rely on the library to cut itself down to a common/universal featureset.

Somewhat to my surprise however, there was a decision made by Product Managers, that they didn't want to ship a cut-down syntax unless unless it was a supported configuration for PCRE2. (This seemed strange to me.) The likelihood is that our application will ship, with the entirety of PCRE2's custom dialect exposed to customers. This will lock us in to shipping PCRE2 as the only backend that can support this regex dialact, for the entire many-decades-long expected lifetime of this feature. (Bear in mind that Excel is 40 years old, and new features have to plan for decades of maintenance with zero backwards compatibility regressions.)

So, I thought it was at least worth seeing if you'd take this patch!

zherczeg commented 2 months ago

I understand the motivation, and why this is useful for you. Actually I have a story for you: c++ interface for pcre1. Google wanted to land a large piece of code, and it was accepted. Then they disappeared, and nobody wanted to maintain their code. Then a few people complained that the code does not work, the interface is not complete, etc. It was just a lot of burden, with no gain. As for pcre2, we simply dropped the whole thing.

It is good to think 10 years ahead, and we also do it from our perspective. This feature is mostly needed for you (a single entity), and not a generic interest. Then, who will maintain this change 10 years later? It is not necessary bad, if company interest appear in the code, but we also need to see that the support is long term, and the project also benefits from it.

NWilson commented 2 months ago

Thank you, I understand! I'm very sympathetic, and I wish that we were more willing at Microsoft to maintain our "universal regex dialect" internally.

carenas commented 2 months ago

One thing I would like to mention is that you are likely going to be one of the main users of the pcre2_substitute() API, and at least for changes in there that should give you a lot of leverage to make sure your needs are supported (if they are a little bit more narrow and had visible use cases outside of yours).

For example, I DO see how we could add a flag that would reject $name in the replace string, just like we have done with \0 through an extended option and that would seem to make sense outside of your specific needs, while also meeting your internal requirements to have that upstream.

Similarly, callouts can be disabled already at build time, because they don't make sense for all users.

NWilson commented 1 month ago

Thank you all for your comments. I will close this PR, since you're right it's not suitable for PCRE2.

PCRE2Project / pcre2

Add PCRE2_EXTRA_VANILLA_SYNTAX to disable PCRE2 extensions #479

Goal