Open jjdseattle opened 6 years ago
cc @krwq, @tarekgh, @joperezr, @Anipik
Here are the existing properties on StringComparer: https://github.com/dotnet/corefx/blob/fcb0efcd92035c22c93bd7442adbb5a5a99cdffb/src/System.Runtime.Extensions/ref/System.Runtime.Extensions.cs#L947-L952
@grabyourpitchforks
Just wondering why we need the 32-bit code points comparer too? is it commonly used? I think Utf8 is a higher priority here.
Also, the StringComparer has 3 methods (Equals, Compare and GetHashCode) which take string parameters which I don't think it makes sense to have these for Utf8 (or Utf32). We need to think about that.
For well-formed UTF8, ordinal collation based on UTF8 code-units produces identical results to ordinal collations based on the represented 32-bit code-points. (This statement is one of the design features of UTF8 encoding.) So I'm asking for only one string-comparer. It's name can be chosen to suggest that it performs UTF8-base comparison. Or that it does comparison based on 32-bit code points. The behavior in 8-bit space and 32-space should be identical.
As noted in the original request, from a collation point of view, one interesting question is what do in the presence of a lonely-surrogate in a System.String. If we consider the lonely surrogate as representing a code-point in the range of U+D800 to U+DFFF (rather than treating is as being replaced with Unicode replacement character U+FFFD), then Equals and GetHashCode could share behavior with StringComparer.Ordinal.Equals and StringComparer.Ordinal.GetHashCode. Only StringComparer.Compare() would differ. (However, note that three-byte representations of U+D800 through U+DFFF are not permitted in well-formed UTF8; this fact might be a reason enough to name this Comparer to suggest that it operates in the space of 32-bit code-points rather than 8-bit UTF8 code-units.)
For well-formed UTF8, ordinal collation based on UTF8 code-units produces identical results to ordinal collations based on the represented 32-bit code-points. (This statement is one of the design features of UTF8 encoding.) So I'm asking for only one string-comparer. It's name can be chosen to suggest that it performs UTF8-base comparison. Or that it does comparison based on 32-bit code points. The behavior in 8-bit space and 32-space should be identical.
If we expose a comparer, we have to support non-ordinal operations too. that mean UTF32 comparisons will not work as UTF8 comparisons.
Just to be clear, when you say ordinal comparison by code points, you're asking for something like this?
UTF-16 string [ 1234 E000 D800 DFFF ]
-- the string above should be sorted before the string below --
UTF-16 string [ 1234 D800 DFFF ]
Right.
I proposed a StringComparer with results that would be identical to a comparisons based on 32-bit (UCS4) code points. In 32-bit code points, we see that
U+01234 U+0E000 U+103FF
sorts before
U+01234 U+103FF
.
Converting UCS4 code-points to UTF8 code-units happens to be order preserving. So one could also think of this comparison as occurring in UTF8 space:
E1 88 B4 EE 80 80 F0 90 8F BF
sorts before
E1 88 B4 F0 90 8F BF
.
(I think that producing results identical to a comparison in UTF8 code-unit space is the more compelling motivation for the additional comparer. But UTF8 code-unit space and UCS4 code-point space produce identical results.)
From: Levi Broderickmailto:notifications@github.com Sent: Monday, January 20, 2020 2:39 PM To: dotnet/corefxmailto:corefx@noreply.github.com Cc: Jerry Dunietzmailto:jerry.dunietz@live.com; Authormailto:author@noreply.github.com Subject: Re: [dotnet/corefx] Add a StringComparer static property for an ordinal comparer based on 32-bit code-points or 8-bit UTF8 code-units (#31443)
Just to be clear, when you say ordinal comparison by code points, you're asking for something like this?
UTF-16 string [ 1234 E000 D800 DFFF ]
-- the string above should be sorted before the string below --
UTF-16 string [ 1234 D800 DFFF ]
- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fcorefx%2Fissues%2F31443%3Femail_source%3Dnotifications%26email_token%3DAJ663EWLFBMOWSUBVMTF4J3Q6YR3HA5CNFSM4FMTDY72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJN7DNA%23issuecomment-576450996&data=02%7C01%7C%7Cc561b07cf893472749f508d79df9a7ce%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151567887939643&sdata=QhJ5GbE8NN%2FWS2vXsOZcUhqOkU7oqdoH9LUDhKjfoAM%3D&reserved=0, or unsubscribehttps://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJ663EQBWMOUODTIYZFDVI3Q6YR3HANCNFSM4FMTDY7Q&data=02%7C01%7C%7Cc561b07cf893472749f508d79df9a7ce%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151567887949652&sdata=5jAAJ4jzghad3P0VHRSZLTYT3Q4ZsoUTDpR%2FE7LBlgE%3D&reserved=0.
Collation is normally performed in a linguistic-aware fashion. Existing linguistic-aware comparers should already handle this case correctly by decoding multi-code unit sequences for both UTF-8 and UTF-16.
Most people use linguistic-unaware collation as a glorified memcmp operation, which isn't quite meaningful when comparing two texts with different encoding. What scenario do you have in mind when performing such a collation against mixed UTF-16 and UTF-8 data?
I’m concerned with building systems that must collate in a compatible manner to mon-CLR code that in turn is implemented based on linguistic-unaware UTF8 comparison.
Admittedly, as CoreFx support for UTF8 encoding grows, the need for such a compatible collation of UTF16-encoded string is reduced. Still, it would be valuable to be able do a UTF8-compatible ordinal collation without the performance (including memory) overhead of actually converting to UTF8 or UCS4 in order to do that collation.
From: Levi Broderick notifications@github.com Sent: Monday, January 20, 2020 3:22 PM To: dotnet/corefx corefx@noreply.github.com Cc: Jerry Dunietz jerry.dunietz@live.com; Author author@noreply.github.com Subject: Re: [dotnet/corefx] Add a StringComparer static property for an ordinal comparer based on 32-bit code-points or 8-bit UTF8 code-units (#31443)
Collation is normally performed in a linguistic-aware fashion. Existing linguistic-aware comparers should already handle this case correctly by decoding multi-code unit sequences for both UTF-8 and UTF-16.
Most people use linguistic-unaware collation as a glorified memcmp operation, which isn't quite meaningful when comparing two texts with different encoding. What scenario do you have in mind when performing such a collation against mixed UTF-16 and UTF-8 data?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fcorefx%2Fissues%2F31443%3Femail_source%3Dnotifications%26email_token%3DAJ663ET7RMZYXY7R7F7WYTTQ6YWY7A5CNFSM4FMTDY72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOA6BY%23issuecomment-576458503&data=02%7C01%7C%7C35870abbdce64a6db0b208d79dff881b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151593126518109&sdata=W4CoK%2BDJ3IZoW32fkd47smNtfVjKn%2B4cDlqPNo%2BBjGE%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJ663EXEK2AHRMGBXFV3ASDQ6YWY7ANCNFSM4FMTDY7Q&data=02%7C01%7C%7C35870abbdce64a6db0b208d79dff881b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151593126528103&sdata=c2sz4XNJ3Jvh0MtF2yg6BbmXX0HXKOeNP%2FuJfWjcjGk%3D&reserved=0.
It is increasingly common for .NET programmers to write code that interops with systems dealing with text encoded in UTF8. It is non-obvious from its name along that StringComparer.Ordinal can produce different results from an ordinal comparison based on either 8-bit UTF8 code-units or 32-bit code-points. For well-formed UTF8, comparison based on UTF8 code-units matches comparison based on 32-bit code-points. But if a System.String contains a surrogate pair encoding a single code-point, then StringComparer.Ordinal can produce different results.
Please add a new static property to StringComparer for a comparer that does ordinal comparison based on either 32-bit code-points or an 8-bit UTF8 code-units. Possibilities for its name are StringComparer.Ordinal32, StringComparer.OrdinalUtf8. StringComparer.Utf8Ordinal, and many others.
(One interesting spec detail is in decidiing the appropriate behavior when an input System.String contains a "lonely surrogate" and is therefore ill-formed UTF16.)