Implement the new set-based character classes

NWilson commented 1 month ago

This is a mostly-complete implementation!

It needs some tidy-up, in places where I've left "XXX" comments. But I've hammered out the code and I'll start testing it.

I'm afraid I used recursive descent in the operator-precedence parser, but it only goes down to a maximum depth of 15 levels, so it should be OK. I just wanted to prototype quickly. I don't know "how much stack is too much" in PCRE2 - it's only pushing ~hundreds of bytes on the stack at most.

The two big bits of logic that need to be shared with existing code, are the OP_XCLASS interpreter code, and the compiler code to build the OP_CLASS/NCLASS/XCLASS code. The current code is inlined into the one place where it's used.

Anyway, this is a draft of where it's going.

zherczeg commented 1 month ago

Pattern (bracket) parsing is already recursive. Although you can also a use a simple array for simulating recursion. I suspect a pointer in the META code is enough to track the status.

PhilipHazel commented 1 month ago

I do know that there are applications running PCRE2 in a large number of threads, each of which has very limited stack, but I don't know how limited. One would hope a couple of hundred bytes would be OK. I suppose we do need a new option to turn this on because changes the meaning of things, but it seems a shame to use one of the only two remaining main options bits. If/when the Perl syntax is implemented, it shouldn't need to be enabled.

I got a compile error when trying to build your branch:

src/pcre2_compile.c:6110:28: error: passing argument 1 of 'compile_class_nested' from incompatible pointer type [-Wincompatible-pointer-types]
 6110 |       compile_class_nested(&pptr, &code);
      |                            ^~~~~
      |                            |
      |                            uint32_t ** {aka unsigned int **}
src/pcre2_compile.c:2669:39: note: expected 'const uint32_t **' {aka 'const unsigned int **'} but argument is of type 'uint32_t **' {aka 'unsigned int **'}
 2669 | compile_class_nested(const uint32_t **pptr, PCRE2_UCHAR **pcode)

I fixed this by inserting the obvious cast, but is that right? Haven't done any testing yet. I see the CI tests are failing; probably the same thing.

NWilson commented 1 month ago

I'm expecting that the Perl syntax will be able to reuse the same interpreter support and OPcodes.

I won't tackle implementing the parser & compiler for the Perl classes in this PR though, there's enough in it already. I agree it should be OK to add in a follow-on PR, and it wouldn't need any flags.

I imagined that it might be worth spending a "normal" parse bit on this flag? This flag implements a behaviour that's in JavaScript/Python/Rust and other engines, so it's more mainstream than all the other minor things I've recently added to the extra options. But you're quite possibly right, those bits are too precious now to spend on non-Perl-related syntax extensions.

PhilipHazel commented 1 month ago

How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:

PCRPCRE2 version 10.45-DEV 2024-06-09 (8-bit)
/[[a-c] && [b-d]]/set_class
Failed: error 189 at offset 16: internal error: unknown code in parsed pattern

This happens at line 6441 in pcre2_compile.c.

PhilipHazel commented 1 month ago

Oops sorry, I told a lie. It's around line 8682 (didn't look for multiple occurrences of this error).

PhilipHazel commented 1 month ago

Thinking about the name PCRE2_SET_CLASS ... I assume that some time in the future, the Perl syntax will always be recognized, so what this option is enabling is an alternative syntax for what Perl calls "extended bracketed character classes". So perhaps a name like PCRE2_ALT_EXTENDED_CLASS is more descriptive. This is the same style as other PCRE2_ALT_xxx options.

NWilson commented 1 month ago

How ready is this for testing? I added the obvious to pcre2test to set the flag, but my first test failed:

Not quite ready. It's just for getting feedback, if anyone's interested in looking (thanks for trying!).

I'm happy to rename from "set[-based] class" to "extended class". That will be easy enough to do.

NWilson commented 1 month ago

Simple testcases now pass, such as [abc--b].

The one thing that's broken (that I know of) is that you can only include single characters like 'a', not ranges like 'a-z' nor properties like \p{..} or \d. I still need to pull out the delicate code for doing that, into a reusable chunk I can call.

NWilson commented 1 month ago

I've added docs.

I'm down to these tasks:

[x] JIT support - is that needed for this PR, or can it come in a future one?
[x] Need a lot more tests, I'm currently writing them. Will be done in a day or two
[x] One chunk of code (marked by a big "XXX") needs to be pulled out into a separate function.
[x] I'll move all the code into pcre2_compile_class.c as requested

zherczeg commented 1 month ago

Some random comments.

I don't see a negate operation (can be implemented as ^ 0x1). For example, it is needed for [^\PL-\x{100}-\x{200}].

There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space.

The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion.

There should be tests for reaching the maximum allowed depth. They should throw parse errors.

Is [&&] allowed? Is it simply matches to nothing? Does [[]&&[]] depends on allow empty classes flag? Btw how "allow empty classes" affects the parsing?

There should be optimizations for ||. [[[\pS]||]||[[[\pL]]]] is just [\pS\pL]. This is easy to do during parsing.

Personally I would just implement [] and || at the first stage, do all the optimizations, parse errors, discuss all the minor details. This only requires parser modifications, so you don't need to worry about matching.

NWilson commented 1 month ago

Thank you @zherczeg for reviewing!

I don't see a negate operation (can be implemented as ^ 0x1). For example, it is needed for [^\PL-\x{100}-\x{200}].

It was there in the PR, it's OP_ECLASS_NOT. There isn't a META for it, because it's encoded as META_CLASS / META_CLASS_NOT distinction. But I had implemented the interpreter support.

There is no point to duplicate bitsets (for character 0..255). If one xclass has a bitset, the full bitset should be computed for the entire class. It is faster to evaluate and requires less space.

True. I haven't done this optimisation.

The parser can be implemented with using a simple stack in the parser context (e.g. 32 uint32_t values, which contains the offsets of the currently active operators). This way the recursion does not require any recursion.

Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting-yard thing... but it would be more code than recursive descent.

Note that the META parser (the one that consumes characters and produces META) is just a simple for-loop, with an int depth state. It's the parser that consumes the META and produces OPs which has to parse the recursive grammar.

The grammar is certainly recursive - the question is simply whether (given that it has a fixed nesting depth of 15) we need to reduce the stack consumption, for embedded devices.

I can check how much stack space it actually uses. Currently it looks like it chews up around 100 bytes per recursion, in a 64-bit Release build, so that comes out to a couple of KiB of stack space consumed. The shunting-yarn approach would cut that down a bit, but I'm really not sure it's worth it.

There should be tests for reaching the maximum allowed depth. They should throw parse errors.

Added. Lots more tests added.

Is [&&] allowed? Is it simply matches to nothing? Does [[]&&[]] depends on allow empty classes flag? Btw how "allow empty classes" affects the parsing?

Good question. [&&] is not allowed: an operator must have something non-empty on either side.

The [[]&&[]] case does indeed depend on allow_empty_class. That flag affects not just the outermost [...], but all of the nested classes opened by a '[' character. The allow_empty_class flag affects the inner nested classes in the same way as an outermost class: if the first literal is a ']', then the flag determines whether it's a literal or a class close ']'.

There should be optimizations for ||. [[[\pS]||]||[[[\pL]]]] is just [\pS\pL]. This is easy to do during parsing.

That would be nice... but I'm so far totally focussed on completeness and correctness. I'd rather get the PR merged, than hold it up for edge-case optimisations.

Personally I would just implement [] and || at the first stage, do all the optimizations, parse errors, discuss all the minor details. This only requires parser modifications, so you don't need to worry about matching.

Even with just ||, you need changes to the matcher: [a[^b]] is pretty simple, but requires runtime op support.

NWilson commented 1 month ago

Updated task list:

Could add some more tests, there's a handful of missing cases. But it's somewhat thoroughly exercised already.
Still no JIT support. Unsure whether I need to do this to get it merged.
And certainly no (?[...]) support. That can come in a follow-up PR, soon after
I haven't implemented the special-cased support for [ab]+ (following quantifier) that XCLASS has. I guess I could do that for ECLASS too, although it works fine without.
Still a question of whether it should be parsable without any recursion, to save a KiB of stack space

NWilson commented 1 month ago

Also - regarding Zoltan's example of [[[\pS]||]||[[[\pL]]]]. The [thing||] syntax is not legal. And the extra brackets [[[thing]]] syntax is already optimised away (compiles to the same OPs as [thing] automatically).

PhilipHazel commented 1 month ago

Still no JIT support. Unsure whether I need to do this to get it merged.

Merging without JIT support should be possible. Because you have added new opcodes, a call to pcre2_jit_compile() will fail, and matching should fall back to the interpreter. There are still, I think, one or two other constructions that JIT does not support.

zherczeg commented 1 month ago

Hmm. Can it really? We have a grammar with two operators, with different precedence levels. I can code up a simple shunting- yard thing... but it would be more code than recursive descent.

It is usually easier for optimizations, since it is easy to drop operators, and rewrite the META stream. This is hard for actual recursion, since you cannot drop functions form the function stack. Not in C at least.

Example: [[[a-f]]||[i-k||m-o]]

PUSH [ -> add OP_???CLASS to meta stream, push a pointer to the token onto the pointer (or unsigned offset) stack PUSH [ -> similar to before, adds data to the META stream, and a pointer to it in the stack PUSH [ COPY a-f -> just adds data to the META stream, pushes nothing Now a ] comes, we need to find what happens. Since we have an actual class on the stack ([a-f), we terminate it. PUSH ], and drop '[' from the stack. Now another ] comes, and we see, that a [ on the stack, which is followed by a class in the META stream. Since that class is NOT pushed onto the stack, it is a full class literal. Then we delete the OP???_CLASS form the META stream by a memove to the offset of it (it is known in the stack). We don't add anything to the META stream, and pop the top element of the stack. Essentially [[a-f]] is converted to [a-f]

Then a || comes, and we check what is on the top of the stack. It is a '[', and it followed by a class which is not pushed onto the stack. We again delete the [] part with a memove, and the pattern is basically turned to an [a-f||...in the META stream. PUSH || -> we don't know how the pattern will turn out, so we need to push the operator. PUSH [ COPY i-k PUSH || COPY m-o

And now the interesting part comes, since we encounter an ']'. The top item is ||, and if we go back one item in the stack, and it is || or '[', then we can delete the || from the stack, which turns [i-k||m-o] to a i-km-o. We don't push anything to the stack.

Then another ']' comes, and the [a-f||i-km-o] is turned to [a-fi-km-o]. The topmost [] is never removed.

Adding precedence to operators is not hard, and you need to handle ^ as well, since it inverts a sub group. Tracking the current stack depth is also possible (cannot be > 32). This is why I said that supporting [ and || should be the first step, and then you have a strong foundation to build things further. If you do it in the opposite way, you might need to rewrite the entire algorithm, and that is a lot of effort.

The challenging part is that the META stream size is limited, and perl has single character operators, so we need to encode everything into one uint32_t value.

Btw [a&&bc] is [a&&[b]c] or [a&&[bc]]?

NWilson commented 1 month ago

Well, as I said, "I can code up a simple shunting-yard thing... but it would be more code than recursive descent."

You are describing the classic shunting-yard thing... and it really is more code, and more complexity.

At the moment, I have implemented everything (^, ||, &&, and --), and it's both robust, and correct. So I don't see it as a bad foundation at all. I could rewrite the second parsing phase, but in my opinion the only reason to do so would be to save a tiny bit of stack space.

I have implemented most of the optimisations you describe. I haven't implemented merging [[a-f][i-k]] into a single class, which would be a bit fiddly, but I guess could be possible.

Note also that the || is very unlikely to be used in practice. It's simply there for completeness. JavaScript chose not to add it, but Rust and Python regex module did. I don't think it's worth doing any special optimisations for it.

Btw [a&&bc] is [a&&[b]c] or [a&&[bc]]?

[a&&bc] is [a&&[bc]] (also [a&&[b]c] is [a&&[ [b]c ]]). The ^ character is not an operator, it's a modifier on nested classes to make them into "yes-classes" or "not-classes". The implicit-union operator (juxtaposition) has tightest precedence (implicit union is any juxtaposition of atoms like single characters or ranges, and also nested classes; for example, [[a-b][^c-e]]). The explicit operators && || -- have looser precedence (and undefined precedence relative to each other - you aren't allowed to mix them).

I'm not at all opposed to changing the parser. I just want to be clear on the goal of the improvement.

PhilipHazel commented 1 month ago

If it's all implemented and working, I think we (that is, you :-) should proceed and get on with what is left to be done. Internal algorithms can always be changed later if that proves to be beneficial.

NWilson commented 4 weeks ago

Great! I am pretty-much done here.

I can see why Zoltan wants more optimisations - that's his role in life, adding a JIT after all!

I'm warming up to the idea of doing some extra work to satisfy that.

The branch is ready for review; I'll only be adding some quite local changes in the parser if I do that extra work.

The stuff I don't understand well enough:

changes in pcre2_study, and quantification, and the interpreter...
particularly auto-possessification
the "reqcu" flags
and other PCRE-specific stuff, where I just don't have the background, and I may have goofed.

The I think I do understand:

I'm confident with my parsing and matching functions. They're pure, side-effect-free things and just do what it says on the tin.

PhilipHazel commented 4 weeks ago

I see there is still one failing CI test.

NWilson commented 4 weeks ago

For me these should work like a usual expression evaluator, with unary/binary operators and precedence. The primary expressions should be boolean values, [] can be used for grouping.

That is exactly how it works, I think. I didn't do anything "unusual" on purpose. There are primary expressions (atoms) which are characters/ranges/POSIX-props/Unicode-props, and there are binary operators (no unary), and [] can be used for grouping. Just as you say.

Very classic recursive descent with precedence, producing a RPN sequence of operators.

I see there is still one failing CI test.

Yes, it's weird. Moving some code around triggers the Werror=maybe-uninitialized warning in alpine's version of GCC, although I didn't really touch that code. It's full of gotos; I guess the detector is a bit sensitive.

PhilipHazel commented 4 weeks ago

Is this a bug?

/[abc -- b]+/alt_extended_class
    acacbac
 0: ac

I expected it to match "acac".

PhilipHazel commented 4 weeks ago

Re my previous comment: under DFA matching it does match "acac".

PhilipHazel commented 4 weeks ago

Another example where DFA does what I expect, but non-DFA doesn't. May be same bug as before.

/[\pL || [:digit:]]+/alt_extended_class
    +a03+
No match
    +a03+\=dfa
 0: a03

NWilson commented 4 weeks ago

Yes, it is a bug. I found a similar case earlier this afternoon while adding more tests of my own.

Very odd, something do with the fact that I changed OP_ECLASS to take part in auto-possessification, in the same way that OP_XCLASS does. I introduced the regression at that point.

It turns out to be a small one-line copy-and-paste error, when I copied a bit of code from XCLASS to the ECLASS case, but changed a break into a return (oops!).

(Anyway, all of Philip's cases are now fixed.)

NWilson commented 4 weeks ago

I'm done with this!!

Whew. I'm not aware of anything broken, and it's feature-complete, as far as the ALT_EXTENDED_CLASS flag goes.

I've filed tickets for the follow-up work, so it's not forgotten (optimizations; and (?[...]) support), and I'll commit to doing it beginning next week.

I've either addressed all of Zoltan and Carlo's feedback so far, or asked for the points raised to be deferred.

Many, many thanks for all the help reviewing and testing this!

PhilipHazel commented 4 weeks ago

Concerning the documentation: Many thanks for what you have done so far, Nick. Once the Perl syntax is also implemented, I was thinking that it might be helpful to have an overall section called "Extended character classes" that explains that there is Perl syntax and everybody else's syntax and then have two separate sections, one for Perl and one for the other. I am happy to work on the documentation when the time comes.

Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ?

carenas commented 3 weeks ago

do we need to add (*ALT_EXTENDED_CLASS) ?

I actually had a patch implementing (?X), but didn't send it because the "incompatibilities" between the old syntax and the Unicode one are so minimal that I think it would be better not to have one, and wait to see what happens after it gets released.

NWilson commented 3 weeks ago

Incidentally, do we need to add (*ALT_EXTENDED_CLASS) ?

I think not Philip. We are currently missing start-of-pattern, and in-pattern, controls for all of the following: ALLOW_EMPTY_CLASS, ALT_BSUX, DOLLAR_ENDONLY, MATCH_UNSET_BACKREF, ALT_CIRCUMFLEX, and others.

These are all flags which affect the regex dialect in some way, rather than fundamentally change what is or isn't expressible.

I don't think we need a start-of-pattern flag for any of these, nor for ALT_EXTENDED_CLASS: if a pattern author knows to add the flag, they can just as easily edit the pattern itself to match PCRE's dialect. (Admittedly, that would be fiddly for a flag like ALT_CIRCUMFLEX, but certainly possible to rewrite it in terms of ^ and \z. Similarly, anyone who wants ALT_EXTENDED_CLASS can just rewrite to use the Perl extended syntax.)

NWilson commented 3 weeks ago

I have addressed Carlo and Zoltan's latest review comments: more TODOs; Carlo's minor improvements.

PhilipHazel commented 3 weeks ago

I am not away, but I have been letting you guys get on with this patch. I have not looked at it in detail.Please go ahead and merge when you are all happy. (I am also distracted getting a new desktop PC up and configured because my old one is over 10 years old and I'm not sure how long it will go on.)

NWilson commented 3 weeks ago

Thank you Philip! I've squashed again, and added the TODO comment in printint.c requested by Zoltan.

I'm ready to merge, since it's fully-functional (as far as I'm aware), and I'm eager to crack on with the follow-up PRs.

zherczeg commented 3 weeks ago

Ok, patch is landed. I except a lot of follow up works, but it is the nature of these large features.

NWilson commented 3 weeks ago

Great, thank you! My next PR should follow later this week.

PCRE2Project / pcre2

Implement the new set-based character classes #523