PCRE2Project / pcre2

PCRE2 development is now based here.
Other
917 stars 191 forks source link

Proposal for simplifying case-folding (remove PT_CLIST) #514

Open NWilson opened 1 month ago

NWilson commented 1 month ago

One thing that bothers me is the way PCRE2 doesn't handle the ASCII range consistently.

In "normal" (UTF/UCP) mode, the characters /k/i and /s/i are handled differently to /a/i (or all the other characters), with a PT_CLIST opcode, rather than an ordinary OP_CHARI. This means that those two characters miss out on all optimisations and special opcodes like OP_STARI.

It just pains me a little that it's so non-uniform.


I'm about halfway through a PR to get rid of the PT_CLIST technique. But it occurred to me, I should check you like the idea before I spend more time on it!

Current strategy:

New strategy:

This new strategy should be no slower - we're literally doing exactly the same amount of work, both at compile-time and match-time.

I've done enough of the PR to be fairly sure it'll work. It's certainly "correct". It simplifies the matcher, marginally, and makes the character handling more uniform, which is always a bonus.

I can't make any promises, but I don't expect performance to suffer, because it's the same number of operations, theoretically at least. We'd need to measure to be sure, of course.


So - thoughts?

PhilipHazel commented 1 month ago

Interesting, let's see the PR. I cannot remember why I did it the way I did, but probably I wasn't bright enough to think of the case folding approach. Do note, if you haven't already, that the PCRE2_EXTRA_CASELESS_RESTRICT option affects the handling of k and s.

NWilson commented 1 month ago

Yes, I'm aware of PCRE2_EXTRA_CASELESS_RESTRICT. I'm also planning to add a "PCRE2_EXTRA_TURKISH_I" to allow selecting the Turkish case-folding rules.

Very annoyingly... Unicode has "Turkish casefolding" and "every other language in the whole world". That's just for the "simple casefolding" dataset (1-to-1 character mappings only).

Turkey is the most populous European country, and one of the Tier 1 supported languages in Excel. It's not their fault they ended with a bothersome alphabet.

PhilipHazel commented 1 month ago

I have always thought of PCRE as being language agnostic. That is, it deals with characters without regard to any language-specific rules (such as choosing which lower case Greek S to use and matching SS with a German Eszett is in a caselessly matching pattern). I suppose your Turkish proposal isn't really language-specific, rather script specific, and that does seem similar to all the rules about script extensions and script runs and grapheme clusters. I'd call it PCRE2_EXTRA_TURKISH_CASING, however.

NWilson commented 1 month ago

Yes, it's annoying that they succeeded in making the CaseFolding.txt data locale-invariant, except for the Turkish special-casing (which is described at the top of that file). Otherwise there indeed wouldn't be any language-specific behaviour.

For what it's worth - the two Greek lowercase sigmas, and the German Eszett, are not actually language-specific rules. No matter what language you set your computer to, the Greek alphabet, and the Eszett, will behave the same. Those characters have special-cased behaviour, but it's not dependent on what the current locale is, it's just that the required support is classed as "level 2 Unicode support" rather than "level 1" (one-to-one character mappings).

zherczeg commented 1 month ago

I may not understand well, but the new approach is the same as the old except it has an ASCII fast path in the interpreter? JIT has many optimizations for these cases, but most of those would be too complicated for the interpreter (we would need to add many OP_CHARwhatever variants to single character iterators).

NWilson commented 1 month ago

I'm not sure exactly how the JIT works. I haven't had time to poke my head in there yet!

Does it read the "META" codes from the parser, rather than the "OP" codes from the compiler? And then, it builds custom machine code which is optimised for the character that it's matching.

If you are using the "OP" codes, then you'll need to adjust the JIT for the new behaviour of OP_CHARI.

It looks like there's a little room for improvement in compile_ref_matchingpath, which uses the caseset field, and that shouldn't be necessary (although it is correct, but just not optimal). When I open the PR, you'll the see the code-change in the interpreter.

You do have some JIT code that accesses PT_CLIST, which could go away.

zherczeg commented 1 month ago

JIT uses the final byte code (OP_* codes) for code generation, since it needs to work with pre-loaded byte codes. PCRE2 supports saving/reloading byte code.

It tries to optimize the current opcode. For example, the [b-fB-F] range (which is normally a bitset) is converted to (unsigned)((c) | 0x20) <= ('f'-'b'), and generates machine code accordingly. Furthermore, since only one byte characters can match to this, it reads a single byte even when utf is enabled. Technically, JIT has a read_char(min,max) function, which generates code that only reads characters between min,max precisely. All other characters returns with a random value, which must be outside of min,max range.

Whoever does the jit update, the final patch needs to contain it, since otherwise the CI will fail.

The jit has an optimization for unicode categories, which could be moved to interpreter level. Unicode has 30 categories, so they fit nicely into a 32 bit value as a bitset. During the XCLASS analysis compilation step, all classes are collected, and a 32 bit (immedate in case of JIT) value is created. During matching all category check is done by a single (((1 << category(c)) & category_bitset)) != 0 check.

PhilipHazel commented 1 month ago

JIT uses the final byte code (OP_* codes) for code generation, since it needs to work with pre-loaded byte codes. PCRE2 supports saving/reloading byte code.

... and JIT was originally implemented long before the META encoding was invented! Until release 10.23 the final OP_* code generation was done directly from the input pattern. I finally realized that a lot of issues (many concerned with forward references) could be solved by adding an extra pass.

NWilson commented 1 month ago

Thank you for explaining!

I'll park this issue then - it's not urgent. I'd better come back to it when I have more experience with the JIT.