PCRE2Project / pcre2

PCRE2 development is now based here.
Other
918 stars 191 forks source link

Change xclass character min/max detection #566

Closed zherczeg closed 2 days ago

zherczeg commented 6 days ago

The new code is not necessary, but it will be useful for a future eclass implementation. It is a bit of a slowdown actually, although the code looks nicer.

@NWilson I don't plan any jit support for eclass until the bitset from xclasses moves to a single bitset in eclass, because that changes the bytecode layout.

Ideally, the eclass should only contain an optional bitset for the first 256 characters, and a list from xclass/allany/eclass_operation items. When an xclass is empty (e.g. its bitset handles all of its elements), the empty xclass should be represented by an OP_ALLANY+OP_ECLASS_NOT byte code pair, which can be detected by jit (or even the interpreter).

NWilson commented 6 days ago

OK, feel free to make whatever changes you like to the JIT code.

I'll need to take a break from PCRE2 for a week, I'm very busy this next week. But then I'll move onto the optimisations PR.

carenas commented 5 days ago

I don't plan any jit support for eclass until the bitset from xclasses moves to a single bitset in eclass

This might be the first release we do where a major feature isn't supported by JIT though.

PhilipHazel commented 5 days ago

For PCRE2, yes, possibly, but there was over a decade of PCRE1 before JIT was even invented. :-) And the first releases with JIT didn't support everything ... it was gradually extended.

zherczeg commented 5 days ago

Is this the plan? Am I missed something?

carenas commented 4 days ago

at least git will be affected, because it "mistakenly" uses the JIT fast path and therefore will behave erratically.

i can produce a patch to fix it, but the release process takes about 1 month there, so it might be a constrain for how long we have to keep the RC around, or even if we might need to do an RC2

NWilson commented 4 days ago

Are you referring to this code in git?

https://github.com/git/git/blob/090d24e9af6e9f59c3f7bee97c42bb1ae3c7f559/grep.c#L356

It seems to ...mandate that PCRE2 regexes are supported by the JIT. It dies if you provide a regex that can't be jitted! Hmm. The error message even says, "Couldn't JIT the PCRE2 pattern '%.*s'%s, got '%d'%s" ... "\nPerhaps prefix (*NO_JIT) to your pattern?". So apparently, if you use a new regex like (?[...]) then git will tell you to add (*NO_JIT), rather than silently falling back to the interpreter.

That's a bit harsh. Anyway, I thought that perhaps Zoltan was intending to add JIT support for ECLASS, but he's waiting until I do the required optimisations. (That seems sensible to me - why write code, when you know you'll need to change it later.)

The ECLASS changes in pcre2_match.c are really very short & sweet.

zherczeg commented 4 days ago

Me thought in the same way. For me, merging the bitsest is an important optimization for both speed and pattern size. It will change the byte-code layout, and OP_CLASS/OP_NCLASS will not possible in eclass anymore. It is pointless to work on jit support before that. The non-recursive parser will not change the byte code layout (hopefully), and less important.

@NWilson plans to do the optimizations as we discussed, so I just wait. I think it is too much to expect him to do the jit support, so I will do it. The new code will likely not be too much, but a lot of knowledge is needed, and it will have some tricky parts.

zherczeg commented 3 days ago

Does anybody plans to review this patch?

carenas commented 3 days ago

Does anybody plans to review this patch?

I didn't see anything that worried me when I did originally, but didn't "approve" to give others a chance.

could take a closer look but might take a while and at best would be about some missing "const", most likely.

The commit message might be better though by explaining it as a preparatory patch to implement jit support in the future.

zherczeg commented 2 days ago

I have improved the commit message.