PCRE2Project / pcre2

PCRE2 development is now based here.
Other
889 stars 186 forks source link

Some questions about UCP/UTF etc #507

Open NWilson opened 3 hours ago

NWilson commented 3 hours ago

I want to check whether I've understood the following correctly:

  1. PCRE2_UCP/(*UCP) and PCRE2_UTF/(*UTF) are independent.
  2. UTF determines the meaning of bytes: how to read the uint8_t/uint16_t/uint32_t into an int char. If it's off, then we have one code unit per character; otherwise if it's on, use UTF to read one or more code units into an int.
  3. UCP determines the meaning of an int char. If UCP is off, then the meaning is determined by the tables provided (with values ≥ 256 interpreted simply as opaque unassigned codes). If it's on, then the tables provided should be completely ignored, and the meaning of each int is determined by the bundled Unicode data. If an application specifies some interesting tables, they won't be used in UCP mode, which forces the values 0-127 to be ASCII. I'm assuming that basically everyone expects the characters 1-127 in those tables to be ASCII, but the values 128-255 could well be user-customised, to indicate whether they're working with Latin-1 or ISO-8859-5 (Russian 8-bit extension of ASCII).
  4. The EBCDIC support is interleaved with these somewhat. EBCDIC special-casing is not done when UCP is enabled. When UCP is not enabled, then some special codepaths can be built in, to compensate for the fact that A-Z may not have consecutive codepoints. When PCRE2 is compiled with EBCDIC, we assume that the compiler is also using EBCDIC, so that a char literal 'Z' != 'A' + 25 (and similarly we rely on the fact that when EBCDIC is not enabled, then the char literals 'Z' == 'A' + 25).

I'm also curious about the way that you've chosen to encode the CaseFolding.txt data. That data is described as a table of mappings char → folded char, but you ship that data in an equivalent but different form, as char → OTHER_CASE + a set of a handful of equivalence classes of 3-4 characters that are case-insensitively-equal (ucd_caseless_sets). The "case-closed-sets" representation is important for non-simple case folding, with things like ß→SS, but PCRE2 doesn't support that. The case-folding algorithm was designed so that you wouldn't need to ship the ucd_caseless_sets data.

Indeed, the OP_CHAR and OP_CHARI code, which could (potentially) benefit from pre-computing the OTHER_CASE data, in fact just looks it up from the tables anyway. So, I'm looking into whether it even makes sense to ship the case-folding data in this way.

For my character classes work, I want to be able to handle complex character class operations without ever having to iterate over all the characters in a range. For example, we need to be able to do /[\x{1000}-\x{10000}]/i efficiently, but also /[\x{1000}-\x{10000} && \p{Script=foo}]/i. Storing the case data in separate compressed tables would make this simpler.

Basically, I am hoping it would OK to just ship CaseFolding.txt without as much pre-processing, bundling it as a run-length-encoded table of mappings for the character class intersections to use.

I'm currently reviewing all the Unicode support and trying to check whether any other pieces of it could be simplified.

NWilson commented 2 hours ago

Oh, I think I've completely misunderstood above what PCRE2_UTF and PCRE2_UCP are intended to do.

So they're really not as orthogonal as I'd imagined. There are various combinations of the PCRE2 behaviours that can't be described in terms of these two options. Oh well.

~The UTF and UCP behaviours would ideally be a bit more cleanly (and flexibly) separated.~ I guess it doesn't matter. The combinations of behaviours that are useful, are expressible.

-UTF, -UCP +UTF, -UCP -UTF, +UCP +UTF, +UCP
input codeunits one-by-one multi-unit one-by-one multi-unit
meaning of decoded codepoints
-- for case-insensitivity uses tables always Unicode always Unicode always Unicode
meaning of pattern elements
-- \d, etc uses tables (eg Latin-1); only matches <256 ASCII (but uses user-provided tables, a bit weirdly, rather than a hard-coded ASCII table) Unicode Unicode
zherczeg commented 1 hour ago

The table looks valid. Most of the combinations are legacy. E.g. when utf support was added, it worked in a way that most people want. Then a few adjustment was needed, and ucp created. It is not possible to change it without breaking software.

I have no idea how ebcdic and utf works together, I suspect nobody uses that combination.

Caseless computation: https://github.com/PCRE2Project/pcre2/blob/master/maint/GenerateUcd.py#L676

The ß -> SS causes side effects, which is not resolvable. If a caseless set has [sß], it is not possible to decide an SS matches to s or ß and there is no backtracking to classes. The characters in a set are equal, there is no precedence among them.

What kind of operations do you need? I hope you are not planning to compute the ranges, that would increase the class sizes (memory consumption) a lot. It would be better to make a full expression evaluator.

PhilipHazel commented 1 hour ago

You will probably not be surprised to learn that the current state is the result of a couple of decades of modification. UTF was originally intended just to specify how to interpret a string of {bytes, halfwords, words} as 32-bit characters. UCP was then added to invoke Unicode properties for things like \d and to enable \p. You are correct in thinking that the state -UTF,+UCP was intended for those who want UCS-2 or just plain 16-bit or 32-bit code points to have access to Unicode properties. The ability for users to provide their own tables was an early addition, before UTF, for the benefit of those who were using the locale features that preceded Unicode. Is it still used? Who knows? There's probably some legacy program somewhere.

And yes, they are not entirely orthogonal. Back in the PCRE1 days, UTF support and UCP support were independently configurable at build time (because they were added at different times, UTF in 2000 and UCP in 2010). I've been trawling the PCRE1 ChangeLog file, but I haven't yet found a statement about caselessness in the absence of UCP.

EBCDIC is another whole story. Originally there was just some support for running on EBCDIC systems. Then somebody wanted a version that could interpret UTF-8 codepoints when running on an EBCDIC system. So there are, in theory, two libraries you can build on such a system, one the interprets input as EBCDIC, the other that interprets it as UTF-8. I have no idea whether anybody still uses the latter.

NWilson commented 1 hour ago

What kind of operations do you need? I hope you are not planning to compute the ranges, that would increase the class sizes (memory consumption) a lot. It would be better to make a full expression evaluator.

Interesting, I was planning to compute the ranges at compile-time, as mentioned in an email last week. I had assumed that front-loading as much work as possible into the compile-time would be preferable, and result in a smaller matcher (and indeed, potentially not require any changes at all in the matcher).

Currently, it appears that patterns like /[\x{1000}-\x{10000}]/i actually iterate over each and every character in the range (at compile-time), to compute the matching set, unless I'm missing something (quite possible).

With a char-class representation of "list of intervals", the set operations can be computed efficiently, and storing the casing data as "list of intervals" similarly enables efficient caseless matching over large ranges.

You could be right though, it may be better in the end to do work in the both interpreter and JIT to add an expression matcher. But I'd prefer not to touch both, if I can avoid it, and it's not clear it would be any faster than doing one lookup (binary search) against a precomputed list of intervals.