PCRE2Project / pcre2

PCRE2 development is now based here.
Other
922 stars 194 forks source link

Feature Request: two new options for implicit (*SKIP) #268

Open iwino opened 1 year ago

iwino commented 1 year ago

Consider a pattern that can be partitioned into a series of atomic sub-patterns such that in between any two adjacent sub-patterns there is no sub-pattern in between them at all:

<some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern>

(Any of the aforementioned atomic sub-pattern may be composed of either atomic or non-atomic sub-sub-patterns, but the sub-pattern itself is atomic, e.g. look-aheads, look-behinds, subroutines, atomic groups, possessively quantified sub-patterns etc.)

If an atomic sub-pattern fails, the entire pattern fails (because, again, the pattern is a series of adjacent atomic sub-patterns).

When it is appropriate (i.e. it is correct and does not fail to match any valid sub-string and does not match any invalid sub-string), the backtracking control (*SKIP) can be inserted in between any two adjacent atomic sub-patterns to optimise the pattern.

However, doing so can be cumbersome and can bloat up the pattern easily if the pattern has quite a few atomic sub-patterns.

For example,

<some atomic sub-pattern>(*SKIP)<some atomic sub-pattern>(*SKIP)<some atomic sub-pattern>(*SKIP)<some atomic sub-pattern>(*SKIP)<some atomic sub-pattern>(*SKIP)<some atomic sub-pattern>(*SKIP)<some atomic sub-pattern>

Feature request 1

It would be nice to have an option, say (?AS), that implicitly inserts (*SKIP) before every atomic sub-pattern (including atomic sub-sub-patterns within the sub-pattern itself if any):

(?AS)<some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern><some atomic sub-pattern>

Feature request 2

It would be nice too to have an "extreme" version of similar option, say (?ES), that implicitly causes (*SKIP) to trigger at WHEREVER position in the subject string an atomic sub-pattern fails.


N.B. I'm requesting these features for the "NFA algorithm / depth-first search of the pattern tree".

PhilipHazel commented 1 year ago

This seems to be a very special-case.

zherczeg commented 1 year ago

PCRE2 is a generic engine. There are other projects which can be used to rewrite patterns such as https://github.com/zherczeg/repan You can implement a feature in those projects to do your request.

iwino commented 1 year ago

@PhilipHazel I don't know how "very special case" it is, but then there are other features, such as (?R) and (*COMMIT), that are suitable for "very special cases" too.

I think enabling more opportunities to minimise how far the position in the subject string is moved back whenever a sub-pattern has failed to match is always good to have for optimisation (as long as the correctness is maintained).

iwino commented 1 year ago

@zherczeg While "feature request 1" is something that can be obtained by analysing a pattern and optimising it, "feature request 2" cannot be done without proper support from the regex engine.

Also, simply activating an option instead of having to rely on another program just to do "implicit (*SKIP) " is a lot more convenient.

carenas commented 1 year ago

simply activating an option instead of having to rely on another program just to do "implicit (*SKIP) " is a lot more convenient.

not sure what you mean by "convenient" here without context, but if you are setting a pcre2 flag then you are most likely doing so with a program that calls the corresponding API call, and therefore you might as well do something similar with your regex.

also worth mentioning that the "regex language" is very simple and not meant to rely on an optimizer compiler, so the expression is usually meant to be optimized before sending it to pcre2_compile(), so Zoltan's suggestion was on that line and might mean that the tool that uses repan might not even be ever linked into your final program.

eitherway, looking forward to see some code to back this feature request.

iwino commented 1 year ago

not sure what you mean by "convenient" here without context, but if you are setting a pcre2 flag then you are most likely doing so with a program that calls the corresponding API call, and therefore you might as well do something similar with your regex.

I think you've possibly misunderstood me. In PCRE2, we can specify options from within the pattern itself, using syntactical features such as (?x), (?J), etc. So, I'm not really asking for changes to the API, and I'm certainly not talking about passing flags into API calls. I'm not even talking about API at all. I'm asking for some new additions to the PCRE2 language syntax itself, e.g. the (?AS) and (?ES) made-up options that I mentioned.

The "optimisation" that I'm interested in is the sort of optimisation that can be performed manually when writing a PCRE2 pattern, without having to rely on any program at all. Just like a user can potentially optimise a pattern by employing backtracking controls such as (*SKIP), (*COMMIT), (*THEN), etc, I just want new options that can either implicitly add (*SKIP) to the pattern or implicitly trigger (*SKIP) at any position in the subject string wherever an atomic sub-pattern fails. Really, what I'm requesting is to take (*SKIP) and crank it up to 11.

zherczeg commented 1 year ago

Using an independent tool has several advantages. You can do it at compilation time, and use the generated pattern later. You can use it with other engines. And pcre2 remain simple, focusing on its actual job.

Anyway it seems you need a generator for nfa anyway, so it can add the control verbs as well.

PhilipHazel commented 1 year ago

Adding (*SKIP) as you suggest changes the meaning of the pattern, so I do not think it can be called an "optimisation". Consider this example:

PCRE2 version 10.42 2022-12-11 /(?>.{0,4})(?>\d+)/ ABCDE123XYZ 0: BCDE123

/(?>.{0,4})(*SKIP)(?>\d+)/ ABCDE123XYZ No match

iwino commented 1 year ago

@PhilipHazel Of course, all backtracking controls have to be judicially applied as appropriate as they can affect correctness. This also applies to anything that is atomic. Likewise, "implicit (*SKIP)" would not be different.

One use case where (*SKIP) is useful is to find a tag in a very bloated XML-like file or in multiple bloated XML-like files. (I'm talking about XML's that can be as big as tens of MB per file with hundreds of thousands of tags if not millions).

Consider,

<(?>span\b(?>[^>]*?\bid=)"foo"|[^<]*+(*SKIP)(*F))

This exploits the fact that literal < can only be used as a tag opener (because for any other usage, the character must be encoded as &#60;).

It's such a wasteful and meaningless computation when the left alternation fails and the position in the subject string is moved all the way back to just after the < in the subject string before re-consuming the very same tag by [^<]*+ before (*SKIP) triggers.

"Feature request 2" would be really nice right here:

(?ES)[^<]*+<span\b(?>[^>]*?\bid=)"foo"

There's no equivalent PCRE2 pattern for "feature request 2", (?ES) but, say, the atomic group (?>[^>]*?\bid=) fails, and what (?ES) would do is that, instead of moving the position in the subject string all the way back to where the current match attempt started, it would not move the position at all, which in this case would be to the left of the > character that closes the tag, and then the next match attempt would begin at THAT position.

This is just a simple use case. I've had to do much more complicated stuff.