Closed zherczeg closed 1 month ago
I am also thinking about the side effects of OP_FAIL, since the character is not read. It is not a problem with invalid utf support, since it fails regardless. But it might be a problem with hard partial matching.
Perhaps we should introduce a match nothing opcode. Since it is very rare corner case (not worth much optimization), we could simply generate an empty bitset for it.
Yes, I like your idea of an empty bitset. That should make the behaviour consistent. I must have thought that *FAIL was an optimization, but you have discovered the flaws in my thinking.
@zherczeg I've seen you're doing quite a few changes to character-class processing.
Are those changes nearly done, after this PR, or do you have further work planned? I've started on prototyping the "extended character classes" feature, and I'd ideally like to avoid too many conflicts, so it would be helpful to know when you're done with the code (for the moment).
The primary weakness of the code that it still does not do logarithmic search. I still haven't figured out how it should be done. Adding a list with 32 bit characters consumes too much space. Also there is a problem of ranges / characters, representing characters as ranges also consumes too much space. A 1 bit information could represent if a value represent itslef or the lower value of a range. But that decreases the available bits by one. We could have two sets: one for characters, one for ranges. Too hard to decide.
There is also a question how to do the matching engine. Probably we should mark the beginning of a character range with set operations with an opcode, which allocates a 32 bit number for a stack. The subranges (kind of primary expressions) would push a 1 bit constant onto this stack, and the binary operators use the two lower bits of the stack. This method does not require xclass nesting. The 32 byte bitset for the first 256 characters could be pre-computed for the entire operation.
@zherczeg I think you've convinced me, to do the set operations at match-time, rather than precompute the character set at compile-time.
The downside is, there's no way I'll be able to do the JIT implementation (or rather, it would take me waaay too long). If I did the implementation for the standard matcher, would you be willing to help on the JIT one?
I will help with the jit code. I also would like to discuss the byte code representation before you make the code.
Some ideas:
Let A,B,C,... represent any character sets, which includes characters, ranges, properties
Then, from the [A && B || [^C] && [D -- [^E]]]
class we should generate something like this:
OP_CLASS_EXPR
- Creates an empty 32 bit bitset
(a simple 32 bit value). The class block runs until OP_CLASS_END
, which is an opcode we already have (I hope I correctly remember its name)
OP_XCLASS
for A - checks that the character matches to A. It pushes its result to the stack:
bitset = (bitset << 1) | value
OP_XCLASS
for B - same as above, bitset = (bitset << 1) | value
OP_AND
: bitset = ((bitset & 0x1) | (~0 << 1) ) & (bitset >> 1)
OP_XCLASS
for C - same as above, bitset = (bitset << 1) | value
OP_XCLASS
for D - same as above, bitset = (bitset << 1) | value
OP_XCLASS
for E - same as above, bitset = (bitset << 1) | value
OP_SUB
: bitset = ((~bitset & 0x1) | (~0 << 1) ) & (bitset >> 1)
OP_AND
: bitset = ((bitset & 0x1) | (~0 << 1) ) & (bitset >> 1)
OP_CLASS_END
In this concept, XCLASS is unchanged. There is no operation nested inside XCLASS. We might need negate operation for [^A&&B]
, but that is simple: bitset ^= 1
. The UCD record for the current value can be cached in a local variable. Then evaluating properties would be trivial.
The stack depth can be computed at compile time. If the depth is more than 32, we reject the pattern. I expect that most patterns will be very simple, which only use 1-2 operations.
We can pre-compute the bitset for the first 256 characters. The bitset can be stored in OP_CLASS_EXPR
.
Another alternative: use assertions
Turn [A && B]
to [A](?<=[B])
I don't like this direction though. Evaluating assertions are more costly than the code above.
We can't use assertions, you're right. They would become horrendously complicated for the case you gave: [A && B || [^C] && [D -- [^E]]]
.
I agree we want to re-use the existing OP_XCLASS/NCLASS/CLASS
for the "leaves" of the character set expression. A leaf may have [...]
brackets or may not, for example, in [a-z0-9 -- 5]
the "leaves" are the (existing) OPcode representation for [a-z0-9]
and [5]
. So we don't need any new code for encoding characters, ranges, properties...
We'll need OP_AND / &&
, OP_SUB / --
, OP_OR / || or implicit
, OP_NOT / ^
.
Some examples: [ [ A && B ] [ C && D ] ]
This will require 4 "leave" OPcodes for A/B/C/D, and we'll need to use the OP_OR to combine the two intersections.
Another one: [A -- [^ B && C ] ]
. We'll need the OP_NOT here, to negate the top of the stack after B && C
.
We won't need opcodes for grouping [...]
brackets, as you note. The opcodes are just a reverse-polish sequence.
So the pseudocode would be:
switch (*op_ptr) {
case OP_NEW_CLASS_EXPR:
++op_ptr;
uint32_t stack = 0;
while (*op_ptr != OP_CLASS_END) {
switch (*op_ptr) {
... handle OP_AND etc
case OP_CLASS:
case OP_NCLASS:
case OP_XCLASS:
// This code surely could use a function call to match the xclass, and share the code with top-level
// XCLASS handling. Or use an inline function. Let's not have some goto-based thing to
// share the XCLASS with CLASS_EXPR...
leaf_matched = match_xclass(&op_ptr, subject_ch);
stack = (stack << 1) | leaf_matched;
break;
}
}
matched = (stack & 0x1) != 0;
I have found an interesting issue:
The problem is, that
[]*
should match to an empty string similar to:My idea: when
OP_FAIL
is repeated, and 0 count repeat is possible, we should remove theOP_FAIL
.If you agree, I want to fix this and improve some code after #508