PCRE2Project / pcre2

PCRE2 development is now based here.
Other
917 stars 191 forks source link

Add substitute case callout function #512

Closed NWilson closed 1 month ago

NWilson commented 1 month ago

This arises from the discussion we had a few weeks ago on the Excel call.

We discussed improving the case-handling of the pcre2_substitute function, but in general, Philip seemed not overly-enthusiastic, simply because correct locale-aware handling of user-visible strings is hard.

I agree. This PR adds a callout (callback) function to allow a third-party Unicode engine to be used for user-visible string processing. This can be used by applications to do locale-aware case transformations.

It's still a one-char-to-one-char mapping, which is simplistic, but allows support for more locales than the current system.

Aside: PCRE2's current Unicode handling for pattern matching (using CaseFolding.txt data) is really rather good. This should not be locale-aware, since case-equivalence of characters is defined in a locale-independent manner. The uppercasing/lowercasing performed by pcre2_substitute really is a special case.

zherczeg commented 1 month ago

Just a question. The substitute is not a complex code, would not be a better idea to duplicate it for your use case? I would probably do this to reduce the dependence form a generic code.

NWilson commented 1 month ago

Haha! Yes, I totally agree. I have duplicated the code internally, twice, to make this change, but each time the PR into the Excel codebase has been rejected.

The Excel managers really believe that PCRE2 and all its functions like pcre2_substitute() are perfect, and that any change I make is wrong. That's because "the only regex engine customers know about is PCRE2; if there's the slightest deviation from it then it will break customer expectations and be considered a bug". We are going to exactly snapshot the behaviour of PCRE 10.45, and make it our job to ship that for the next forty years.

So unfortunately, if I want to make any improvements for our application (like better Unicode support) then I have to do that in the official code.

PhilipHazel commented 1 month ago

PCRE2's current current Unicode handling for pattern matching (using CaseFolding.txt data) is really rather good.

I'm flattered. :-)

zherczeg commented 1 month ago

I never thought that PCRE2 is that important from PR perspective. Does the "ship that for the next forty years" is sarcasm, or the actual plan? I am still curious about your longer term collaboration plans with us. This can be discussed in private emails.

PhilipHazel commented 1 month ago

I have done some very minor updates to the documentation updates, including updating the dates and PCRE2 version number (note that for many doc files the date is both at the top and the bottom).

carenas commented 1 month ago

I have done some very minor updates

Don't forget we need to also update the generated pcre2.h.generic file or risk breaking the build for NON-AUTOTOOLS-BUILD users of HEAD.

PhilipHazel commented 1 month ago

Sigh. Yes, of course. Done.

NWilson commented 1 month ago

Does the "ship that for the next forty years" is sarcasm, or the actual plan?

That's not really sarcasm, it is basically the plan.

Excel is an "end-user programming language" in academic jargon. People don't configure it (like apached or exim), people actually write "applications" inside it. And there are billions of users, so any backwards-incompatible change has to be managed very carefully. The regex feature is going into Excel's formula language, which is Excel's "standard library".

The level of caution is on the same kind of level as .NET or Java: never make a backwards-incompatible change, because customers depend on stability. But Excel carries this a level further - even if it's a bugfix, that's regarded as "backwards-incompatible", so behaviours are updated very, very slowly.

Excel is 41 years old currently, and has billions of users (literally, according to public estimates). I think Microsoft is intending it to stay in business for another 40 years.

The short answer is: we will be updating our version of PCRE2 rather rarely, and very cautiously.

I have done some very minor updates to the documentation updates

Thank you, I'm very grateful!

carenas commented 1 month ago

even if it's a bugfix, that's regarded as "backwards-incompatible"

my concern (and I could be wrong since I had only skimmed over the PR and hadn't seen the new API being used by an application) is how are you planning to handle the "obvious" bug that will be coming because of the 1 to 1 character limitation with for ex: Maße —> MASSE

to clarify, I am not objecting to it, but just think it would need to be eventually extended anyway, so it might be better if it works in multiple characters to begin with (at least for its output), specially considering the long term commitment.

NWilson commented 1 month ago

That's a good question, it is a "bug". However, Perl has the same bug if you try "aßc" =~ s/(.*)/\U$1/ (it gives AßC rather than ASSC).

In general, regex engines have poor or non-existent support for multi-character sequences, so it's consistent with everything else in PCRE2 that we don't handle these.

Supporting it would be a substantial effort, with quite a major code change to pcre2_substitute(). The reason is that, as well as multi-character sequences, case transformations are also contextual, and need to be able to do lookahead and lookbehind. So if we wanted to be really fancy... we'd have to save the casing transformations into a buffer, and apply them at the end.

I decided our goal should be the same as PCRE2's case-folding support: to function correctly for the "simple" (one-to-one) character mappings, and not support the multi-character mappings. That's really more than anyone expects from a regex engine, as measured against other engines at least.

zherczeg commented 1 month ago

We need to add that ß is also a greek letter as well, and its uppercase is B. I don't know if unicode has two different letters for them.

NWilson commented 1 month ago

Greek beta is a completely different character to German Esszet (although they do look extremely similar). They have no Unicode properties in common.

Similarly, the Greek uppercase Beta is completely different to uppercase Latin "B" (although fonts often use exactly the same glyph). Same goes for the Cyrillic characters with identical appearance to Latin.

PhilipHazel commented 1 month ago

It does. U+00DF is German "Eszett", U+03B2 is Greek lower case beta.

carenas commented 1 month ago

Perl has the same bug

$ perl -Mutf8 -e '$v="aßc"; $v =~ s/(.*)/\U$1/u; print "$v\n"'
ASSC
NWilson commented 1 month ago

Oh you're right, I'm so sorry. I didn't have utf8 enabled in my test. I get the same result as you.

carenas commented 1 month ago

I decided our goal should be the same as PCRE2's case-folding support: to function correctly for the "simple" (one-to-one) character mappings, and not support the multi-character mappings.

BTW, I don't disagree with the goal, I am just concerned that this API will be baked in the next release and when that goal changes, we will need ANOTHER API to support it.

I see a few options:

1) change the API to accept a size and a struct pointer, instead of the function pointer. The implementation will have a version and the original pointer to the function in a field called "simple". If you feel confident (as I kind of do) that we will only need to support simple and "full" can get rid of the version number and add a second field that will be NULL (but I think the first option is better), obviously we will need to add some error codes for when we get passed a version of the struct that is not supported, but will come useful in the long run.

2) add another parameter to the current API for "full", kind of like 1B but less difficult to use, albeit less flexible and would also force us to define the interface for that function that I don't think we really know.

3) rename the API so that it is clear from the name that it is only used for "simple" case transformations and we will have to add another API (or several more) when the inevitable bug report comes, but at least doing so wouldn't require an API change or a sore looking API name.