Regex incorrectly identify surrogate pair unicode category

MaceWindu commented 8 years ago

This defect makes regex useless when you need to match string using unicode categories and string could contain surrogate pairs:

        [Test]
        public void ClassifySurrogates_Test()
        {
            var value = "𦕒";
            var regexLetters = new Regex(@"^\p{Lo}$");

            var regexSurrogates = new Regex(@"^\p{Cs}{2}$");

            Assert.AreEqual(UnicodeCategory.OtherLetter, char.GetUnicodeCategory(value, 0));
            Assert.AreEqual(UnicodeCategory.OtherLetter, CharUnicodeInfo.GetUnicodeCategory(value, 0));
            Assert.True(regexSurrogates.IsMatch(value));

            // Fails here
            Assert.True(regexLetters.IsMatch(value));
        }

joshfree commented 8 years ago

I suspect this issue exists in desktop and isn't unique to core

tarekgh commented 8 years ago

@Priya91 Unicode Categories looks fine here so the issue looks with regex.

MaceWindu commented 8 years ago

@joshfree, you are right. I didn't dig into source, but I suspect that code responsible for this uses GetUnicodeCategory(char) instead of GetUnicodeCategory(string, charIndex)

peltco commented 6 years ago

I'll take a stab at this as part of the hackaton

danmoseley commented 4 years ago

@peltco I'll unassign you unless you still have an interest 2 years later 😄

wzchua commented 3 years ago

Regex doesn't take surrogate pairs into consideration. RegexCharClass operates on char public static bool CharInClass(char ch, string set, ref int[]? asciiResultCache) Maybe this is where Rune comes in?

ghost commented 1 year ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

tjpalmer commented 1 year ago

This absolutely is a huge weakness in dotnet regex. An issue that I don't have in JS, Python, or Java, for example.

joperezr commented 1 year ago

This absolutely is a huge weakness in dotnet regex.

Can you ellaborate a bit more on that? We haven't seen many requests on adding surrogate support in Regex other than this issue which got closed as not planned given that we haven't seen a lot of people trying to perform searches on text that have surrogates in them. It would be good to understand your scenario (more specifically, why are you searching for surrogates in text) in order to be able to re-evaluate our focus.

tjpalmer commented 1 year ago

I'm working on a system (not public right now) where we expect to promote regex for text parsing, including of natural text, and lacking proper unicode support inhibits compatibility across different regex engines.

Current workaround plan is either (1) skip compatibility or (2) expand to giant charsets at runtime based on unicode tables.

joperezr commented 1 year ago

I see, thanks for sharing. Our implementation of Regex does assume we are operating over single characters in a lot of places, and no one has really researched on what it would take to add support for surrogates (but I do expect it would likely be a lot of work). In order to move forward, we would need someone to design it and to ensure that we can do it in a way that it won't regress any existing case (as we only really expect surrogates to be less than the 1% case). For anyone interested in coming up with this proposal, I would be happy to review it, but this is not one of our current priorities.

tjpalmer commented 1 year ago

Makes sense. Thanks for the feedback. Once we get to that point on our project, we can discuss option (3) contribute surrogate pair support to dotnet. I'm not sure if this is likely, but if we think it seems like a good idea, I'll get back in touch here. Feel free to do with this issue as you see fit meanwhile.

mikesamuel commented 1 year ago

Even ignoring natural language text, regular expressions are often used for splitting source code into tokens. Those tokens include identifiers, and there are extant human languages with letter and number code-points outside the basic plane.

For example, https://en.wikipedia.org/wiki/Miao_(Unicode_block)

danmoseley commented 1 year ago

It might be interesting to know more about how JS/python/Java (or PCRE and Rust) handle this. Is it user-visible? Our engine has a pretty firm context of 2 byte characters. I know PCRE has a 16 bit character mode and an 8 bit character mode. What does it look like to support varying width characters.

tjpalmer commented 1 year ago

For Python, I should clarify that it emphasizes 32-bit code points for all strings, including for regex needs, such as . meaning a full code point. But its re module doesn't have Unicode property names built in. For that, it recommends 3rd party regex. And I haven't looked at implementation details much for the engines in question.

xanatos commented 1 year ago

In Javascript you have a u flag that enables the unicode mode (\foo\u, or new RegExp('foo', 'u')). It enables \u{12345} codepoints, it makes the . match surrogate pairs (and even matches unpaired low/high surrogate pairs), and enables the \p{characterclass} and \P{characterclass}. Note that, for example, \[^x]\u will match any surrogate pair (plus any character that isn't an x)

mikesamuel commented 1 year ago

Besides enabling extra RegExpr syntax as @xanatos notes, the JS u flag affects how the matcher sees string chunks.

https://tc39.es/ecma262/#sec-regexpbuiltinexec

22.2.7.2 RegExpBuiltinExec ( R, S )

...

If flags contains "u", let fullUnicode be true; else let fullUnicode be false. ...

If fullUnicode is true, let input be StringToCodePoints(S). Otherwise, let input be a List whose elements are the code units that are the elements of S.

mikesamuel commented 1 year ago

That view, computing the code-points ahead of matching, is a useful abstraction.

In practice, if your underlying representation of a string is a list of bytes known to be valid UTF-8, you can iterate forward and backwards over code-points.

I see there's a RightToLeft mode which iiuc has semantic significance for which group in /(a+)(a+)/someflags is not length 1 so you need to iterate in reverse.

Is there anything in the current implementation that necessitates random access to characters in the input?

stephentoub commented 1 year ago

if your underlying representation of a string is a list of bytes known to be valid UTF-8

The representation is a UTF-16 ReadOnlySpan<char>. The implementation today looks at individual char (16-bit) values.

Is there anything in the current implementation that necessitates random access to characters in the input?

Backtracking requires the ability to randomly access data in the input, but only to locations previously seen. There are also optimizations that utilize random access to jump forwards, such as jumping to some offset from the end of the input in some cases involving end anchors. Etc.

That view, computing the code-points ahead of matching, is a useful abstraction.

If we were to do anything here, we'd want to compute as much as possible as part of the construction/codegen phase, e.g. as part of the Regex ctor / parsing / tree optimization and/or as part of the source generator. It would also very likely need to be opt-in, such as with a new RegexOptions flag, both for compatibility and for performance. We already use RegexOptions to impact how parsing / tree optimization is performed, e.g. RegexOptions.IgnoreCase causes individual characters in the pattern to be translated at construction/parsing time into the corresponding case-insensitive sets. We'd likely do some amount of similar transformation as part of constructing the tree. We'd also probably need a new representation for known categories like \w, which would then allow matching and code generation to continue current optimizations for such sets and making the more expensive Unicode matching pay-for-play.

joperezr commented 1 year ago

And just to add one more thing, we would also need to consider the case of someone opting into this mode as well as using RegexOptions.IgnoreCase, since IIRC there are various cases for upper or lower case mappings between surrogates and 16-bit unicode characters, so those would need to get added to our logic for case-insensitive comparisons.

xanatos commented 1 year ago

Note that considering the string as codepoints makes a big difference. Example in Javascript:

const rx1 = /\uDE00..$/;
const rx2 = /\uDE00..$/u;
console.log(rx1.test('\uD83D\uDE00xx')); // true
console.log(rx2.test('\uD83D\uDE00xx')); // false: \uD83D\uDE00 is a single codepoint, it can't be partially matched by the RegExp

mikesamuel commented 1 year ago

The representation is a UTF-16 ReadOnlySpan<char>. The implementation today looks at individual char (16-bit) values.

Ah, right. In that case, reverse iteration over code-points is pretty straightforward.

Backtracking requires the ability to randomly access data in the input, but only to locations previously seen.

Ah. An int known to be at a code-point boundary should be just as good then.

There are also optimizations that utilize random access to jump forwards, such as jumping to some offset from the end of the input in some cases involving end anchors. Etc.

Ah, so in /^(.{100})x$/, you offset from end to look for x but don't jump forward by 100 based on a precomputation of the UTF-16 width of (.{100})?

We'd also probably need a new representation for known categories like \w

I think https://unicode.org/reports/tr18/#Compatibility_Properties has recommendations on what goes in \w.

iirc, the ICU4J people did a lot of experimentation of representations of unicode sets for UTF-16 strings.

I think their representation of UnicodeSets that didn't fit in a boolean[256] was a sorted set of ints that they could binsearch into. So /[\u0000-\uD7FFF\uE000-\UFFFF]/ would be represented as [0, 0xD800, 0xE000, 0x10000] which means ([0, 0xD800) ∪ [0xE000, 0x10000)).

mikesamuel commented 1 year ago

And just to add one more thing, we would also need to consider the case of someone opting into this mode as well as using RegexOptions.IgnoreCase

https://unicode.org/reports/tr18/#Default_Loose_Matches recommends semantics for that.

Their suggestion of default case folding means that matching is locale-insensitive: /<script>/ui matches "<SCRIPT>" but not "<SCRİPT>" (note the dotted I) regardless of whether there's a current locale that suggests additional Turkish/Azeri case folding.

stephentoub commented 1 year ago

Ah, so in /^(.{100})x$/, you offset from end to look for x but don't jump forward by 100 based on a precomputation of the UTF-16 width of (.{100})?

The $ anchor is a little complicated because it can match either the end or a \n at the end, and the beginning anchor in your example complicates things as that also impacts what we do, but tweaking your example to be .{100}x\z to get at the heart of your question, the implementation will jump to 101 chars from the end, validate that the next 100 aren't \n, and validate that the ending character is x.

mikesamuel commented 1 year ago

The $ anchor is a little complicated

Quite right. Thanks for reading past that. And instead of . I should have said [\s\S] (or [^] if that's a thing in your syntax).

tjpalmer commented 1 year ago

On this topic, what's the right forum for review of proposed semantics for any work in this area? And does it makes sense to conform to Unicode TR18 as much as possible? Looks like the ICU regex implementation (which claims TR18 conformance) has some test data, which might be handy.

And yeah, for dotnet, requiring an opt-in flag makes lots of sense, for sure.

Anyway, bringing this up in case someone has the chance to work on this. At my employer, we have some interest in this in the future though not immediate resources to work it. But at whatever time someone gets the chance to work this, it might be nice to have some approximate agreement on appropriate semantics.

stephentoub commented 1 year ago

Presumably this issue. As noted, no one on the .NET team itself is intending to work on this any time soon.

tjpalmer commented 1 year ago

Understood. Thanks for the feedback!

danmoseley commented 1 year ago

@tjpalmer is pinvoke to PCRE feasible? I'm not sure of the level of support for surrogates in PCRE but their docs suggest a flavor of support.

tjpalmer commented 1 year ago

We can consider that option as well. Thanks!

danmoseley commented 1 year ago

Cool if you end up doing that it would be interesting to share back here how it went.

ltrzesniewski commented 1 year ago

Sorry for the ad, but no need to use pinvoke, I made a PCRE2 wrapper: PCRE.NET 🙂

I just tried the code from the OP, and replaced Regex with PcreRegex: it handles surrogate pairs, since regexLetters.IsMatch is true, but regexSurrogates.IsMatch is false.

BurntSushi commented 1 year ago

@danmoseley

It might be interesting to know more about how JS/python/Java (or PCRE and Rust) handle this. Is it user-visible?

I'm the author of the Rust regex engine and I'd be happy to answer questions. I'm not sure my experience will be terribly relevant here because the Rust regex engine doesn't consider UTF-16 at all. It works on arbitrary byte sequences that are conventionally valid UTF-8. When Unicode mode is enabled (the u flag, the default), then it treats the codepoint, not the code unit, as the fundamental atom of matching. Now internally, it still does its search one byte at a time, and achieves the "match one codepoint at a time" property by constructing UTF-8 automata. (Which I imagine are far more complex than what you'd need for UTF-16.) But you might not need automata at all. It might "just" require moving your internal engines from working one code unit at a time to one codepoint at a time. Of course, that is likely very much easier said than done.

Anyway, I'll stop there, but I'd be happy to brainstorm.

BurntSushi commented 1 year ago

As an addendum to my previous comment, it looks like Java might handle these cases directly, despite I believe having a very similar representation for strings. Just as one example, 𝛅 does not match \p{L} in .NET, but it does in Java. I mention this because whatever implementation strategy Java uses is far more likely to be relevant to .NET than anything the Rust regex crate does.

cabo commented 1 year ago

            var regexSurrogates = new Regex(@"^\p{Cs}{2}$");

Any discussion that assumes \p{Cs} has a meaning does not apply to regexps based on Unicode characters. These are used in languages with a modern text string type such as Rust, so I don't think we can learn much from Rust about various dotnet legacy string modes here.

gfody commented 1 year ago

this bug was the root cause of a discrepancy in the C# version of a function that strips non-alphanumeric characters, eg:

string StripNonAlphaNumeric(string s) => Regex.Replace(s, @"[^\p{L}\p{N}]", "");

for the input เสื้อยืด (thai word for "t-shirt") the C# version strips the sara uees resulting in เสอยด (thai for the verb "crowd" according to google translate) where the other languages (Ruby and Javascript) correctly strip nothing.

the non-regex workaround is sort of complicated:

string StripNonAlphaNumeric(string s)
{
    var result = new List<char>(s.Length);
    bool last = false, none = true;

    for (int i = 0; i < s.Length; i++)
    {
        if (char.IsLetterOrDigit(s, i) || (last && char.IsSurrogatePair(s, i - 1)))
        {
            result.Add(s[i]);
            last = true;
        }
        else
        {
            last = false;
            none = false;
        }
    }

    return none ? s : new string(result.ToArray());
}

dotnet / runtime

Regex incorrectly identify surrogate pair unicode category #16922