Open MaceWindu opened 8 years ago
I suspect this issue exists in desktop and isn't unique to core
@Priya91 Unicode Categories looks fine here so the issue looks with regex.
@joshfree, you are right. I didn't dig into source, but I suspect that code responsible for this uses GetUnicodeCategory(char) instead of GetUnicodeCategory(string, charIndex)
I'll take a stab at this as part of the hackaton
@peltco I'll unassign you unless you still have an interest 2 years later 😄
Regex doesn't take surrogate pairs into consideration.
RegexCharClass
operates on char
public static bool CharInClass(char ch, string set, ref int[]? asciiResultCache)
Maybe this is where Rune comes in?
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.
This process is part of our issue cleanup automation.
This absolutely is a huge weakness in dotnet regex. An issue that I don't have in JS, Python, or Java, for example.
This absolutely is a huge weakness in dotnet regex.
Can you ellaborate a bit more on that? We haven't seen many requests on adding surrogate support in Regex other than this issue which got closed as not planned given that we haven't seen a lot of people trying to perform searches on text that have surrogates in them. It would be good to understand your scenario (more specifically, why are you searching for surrogates in text) in order to be able to re-evaluate our focus.
I'm working on a system (not public right now) where we expect to promote regex for text parsing, including of natural text, and lacking proper unicode support inhibits compatibility across different regex engines.
Current workaround plan is either (1) skip compatibility or (2) expand to giant charsets at runtime based on unicode tables.
I see, thanks for sharing. Our implementation of Regex does assume we are operating over single characters in a lot of places, and no one has really researched on what it would take to add support for surrogates (but I do expect it would likely be a lot of work). In order to move forward, we would need someone to design it and to ensure that we can do it in a way that it won't regress any existing case (as we only really expect surrogates to be less than the 1% case). For anyone interested in coming up with this proposal, I would be happy to review it, but this is not one of our current priorities.
Makes sense. Thanks for the feedback. Once we get to that point on our project, we can discuss option (3) contribute surrogate pair support to dotnet. I'm not sure if this is likely, but if we think it seems like a good idea, I'll get back in touch here. Feel free to do with this issue as you see fit meanwhile.
Even ignoring natural language text, regular expressions are often used for splitting source code into tokens. Those tokens include identifiers, and there are extant human languages with letter and number code-points outside the basic plane.
For example, https://en.wikipedia.org/wiki/Miao_(Unicode_block)
It might be interesting to know more about how JS/python/Java (or PCRE and Rust) handle this. Is it user-visible? Our engine has a pretty firm context of 2 byte characters. I know PCRE has a 16 bit character mode and an 8 bit character mode. What does it look like to support varying width characters.
For Python, I should clarify that it emphasizes 32-bit code points for all strings, including for regex needs, such as .
meaning a full code point. But its re
module doesn't have Unicode property names built in. For that, it recommends 3rd party regex
. And I haven't looked at implementation details much for the engines in question.
In Javascript you have a u
flag that enables the unicode mode (\foo\u
, or new RegExp('foo', 'u')
). It enables \u{12345}
codepoints, it makes the .
match surrogate pairs (and even matches unpaired low/high surrogate pairs), and enables the \p{characterclass}
and \P{characterclass}
. Note that, for example, \[^x]\u
will match any surrogate pair (plus any character that isn't an x
)
Besides enabling extra RegExpr syntax as @xanatos notes, the JS u
flag affects how the matcher sees string chunks.
https://tc39.es/ecma262/#sec-regexpbuiltinexec
22.2.7.2 RegExpBuiltinExec ( R, S )
...
- If flags contains "u", let fullUnicode be true; else let fullUnicode be false. ...
- If fullUnicode is true, let input be StringToCodePoints(S). Otherwise, let input be a List whose elements are the code units that are the elements of S.
That view, computing the code-points ahead of matching, is a useful abstraction.
In practice, if your underlying representation of a string is a list of bytes known to be valid UTF-8, you can iterate forward and backwards over code-points.
I see there's a RightToLeft mode which iiuc has semantic significance for which group in /(a+)(a+)/someflags
is not length 1 so you need to iterate in reverse.
Is there anything in the current implementation that necessitates random access to characters in the input?
if your underlying representation of a string is a list of bytes known to be valid UTF-8
The representation is a UTF-16 ReadOnlySpan<char>
. The implementation today looks at individual char (16-bit) values.
Is there anything in the current implementation that necessitates random access to characters in the input?
Backtracking requires the ability to randomly access data in the input, but only to locations previously seen. There are also optimizations that utilize random access to jump forwards, such as jumping to some offset from the end of the input in some cases involving end anchors. Etc.
That view, computing the code-points ahead of matching, is a useful abstraction.
If we were to do anything here, we'd want to compute as much as possible as part of the construction/codegen phase, e.g. as part of the Regex ctor / parsing / tree optimization and/or as part of the source generator. It would also very likely need to be opt-in, such as with a new RegexOptions flag, both for compatibility and for performance. We already use RegexOptions to impact how parsing / tree optimization is performed, e.g. RegexOptions.IgnoreCase causes individual characters in the pattern to be translated at construction/parsing time into the corresponding case-insensitive sets. We'd likely do some amount of similar transformation as part of constructing the tree. We'd also probably need a new representation for known categories like \w
, which would then allow matching and code generation to continue current optimizations for such sets and making the more expensive Unicode matching pay-for-play.
And just to add one more thing, we would also need to consider the case of someone opting into this mode as well as using RegexOptions.IgnoreCase, since IIRC there are various cases for upper or lower case mappings between surrogates and 16-bit unicode characters, so those would need to get added to our logic for case-insensitive comparisons.
Note that considering the string as codepoints makes a big difference. Example in Javascript:
const rx1 = /\uDE00..$/;
const rx2 = /\uDE00..$/u;
console.log(rx1.test('\uD83D\uDE00xx')); // true
console.log(rx2.test('\uD83D\uDE00xx')); // false: \uD83D\uDE00 is a single codepoint, it can't be partially matched by the RegExp
The representation is a UTF-16
ReadOnlySpan<char>
. The implementation today looks at individual char (16-bit) values.
Ah, right. In that case, reverse iteration over code-points is pretty straightforward.
Backtracking requires the ability to randomly access data in the input, but only to locations previously seen.
Ah. An int known to be at a code-point boundary should be just as good then.
There are also optimizations that utilize random access to jump forwards, such as jumping to some offset from the end of the input in some cases involving end anchors. Etc.
Ah, so in /^(.{100})x$/
, you offset from end to look for x
but don't jump forward by 100 based on a precomputation of the UTF-16 width of (.{100})
?
We'd also probably need a new representation for known categories like
\w
I think https://unicode.org/reports/tr18/#Compatibility_Properties has recommendations on what goes in \w
.
iirc, the ICU4J people did a lot of experimentation of representations of unicode sets for UTF-16 strings.
I think their representation of UnicodeSets that didn't fit in a boolean[256]
was a sorted set of ints that they could binsearch into. So /[\u0000-\uD7FFF\uE000-\UFFFF]/
would be represented as [0, 0xD800, 0xE000, 0x10000] which means ([0, 0xD800) ∪ [0xE000, 0x10000)).
And just to add one more thing, we would also need to consider the case of someone opting into this mode as well as using RegexOptions.IgnoreCase
https://unicode.org/reports/tr18/#Default_Loose_Matches recommends semantics for that.
Their suggestion of default case folding means that matching is locale-insensitive: /<script>/ui
matches "<SCRIPT>"
but not "<SCRİPT>"
(note the dotted I) regardless of whether there's a current locale that suggests additional Turkish/Azeri case folding.
Ah, so in /^(.{100})x$/, you offset from end to look for x but don't jump forward by 100 based on a precomputation of the UTF-16 width of (.{100})?
The $
anchor is a little complicated because it can match either the end or a \n at the end, and the beginning anchor in your example complicates things as that also impacts what we do, but tweaking your example to be .{100}x\z
to get at the heart of your question, the implementation will jump to 101 chars from the end, validate that the next 100 aren't \n
, and validate that the ending character is x
.
The
$
anchor is a little complicated
Quite right. Thanks for reading past that. And instead of .
I should have said [\s\S]
(or [^]
if that's a thing in your syntax).
On this topic, what's the right forum for review of proposed semantics for any work in this area? And does it makes sense to conform to Unicode TR18 as much as possible? Looks like the ICU regex implementation (which claims TR18 conformance) has some test data, which might be handy.
And yeah, for dotnet, requiring an opt-in flag makes lots of sense, for sure.
Anyway, bringing this up in case someone has the chance to work on this. At my employer, we have some interest in this in the future though not immediate resources to work it. But at whatever time someone gets the chance to work this, it might be nice to have some approximate agreement on appropriate semantics.
Presumably this issue. As noted, no one on the .NET team itself is intending to work on this any time soon.
Understood. Thanks for the feedback!
@tjpalmer is pinvoke to PCRE feasible? I'm not sure of the level of support for surrogates in PCRE but their docs suggest a flavor of support.
We can consider that option as well. Thanks!
Cool if you end up doing that it would be interesting to share back here how it went.
Sorry for the ad, but no need to use pinvoke, I made a PCRE2 wrapper: PCRE.NET 🙂
I just tried the code from the OP, and replaced Regex
with PcreRegex
: it handles surrogate pairs, since regexLetters.IsMatch
is true, but regexSurrogates.IsMatch
is false.
@danmoseley
It might be interesting to know more about how JS/python/Java (or PCRE and Rust) handle this. Is it user-visible?
I'm the author of the Rust regex engine and I'd be happy to answer questions. I'm not sure my experience will be terribly relevant here because the Rust regex engine doesn't consider UTF-16 at all. It works on arbitrary byte sequences that are conventionally valid UTF-8. When Unicode mode is enabled (the u
flag, the default), then it treats the codepoint, not the code unit, as the fundamental atom of matching. Now internally, it still does its search one byte at a time, and achieves the "match one codepoint at a time" property by constructing UTF-8 automata. (Which I imagine are far more complex than what you'd need for UTF-16.) But you might not need automata at all. It might "just" require moving your internal engines from working one code unit at a time to one codepoint at a time. Of course, that is likely very much easier said than done.
Anyway, I'll stop there, but I'd be happy to brainstorm.
As an addendum to my previous comment, it looks like Java might handle these cases directly, despite I believe having a very similar representation for strings. Just as one example, 𝛅
does not match \p{L}
in .NET, but it does in Java. I mention this because whatever implementation strategy Java uses is far more likely to be relevant to .NET than anything the Rust regex crate does.
var regexSurrogates = new Regex(@"^\p{Cs}{2}$");
Any discussion that assumes \p{Cs}
has a meaning does not apply to regexps based on Unicode characters.
These are used in languages with a modern text string type such as Rust, so I don't think we can learn much from Rust about various dotnet legacy string modes here.
this bug was the root cause of a discrepancy in the C# version of a function that strips non-alphanumeric characters, eg:
string StripNonAlphaNumeric(string s) => Regex.Replace(s, @"[^\p{L}\p{N}]", "");
for the input เสื้อยืด
(thai word for "t-shirt") the C# version strips the sara uees resulting in เสอยด
(thai for the verb "crowd" according to google translate) where the other languages (Ruby and Javascript) correctly strip nothing.
the non-regex workaround is sort of complicated:
string StripNonAlphaNumeric(string s)
{
var result = new List<char>(s.Length);
bool last = false, none = true;
for (int i = 0; i < s.Length; i++)
{
if (char.IsLetterOrDigit(s, i) || (last && char.IsSurrogatePair(s, i - 1)))
{
result.Add(s[i]);
last = true;
}
else
{
last = false;
none = false;
}
}
return none ? s : new string(result.ToArray());
}
This defect makes regex useless when you need to match string using unicode categories and string could contain surrogate pairs: