dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

Add a StringComparer static property for an ordinal comparer based on 32-bit code-points or 8-bit UTF8 code-units #26999

Open jjdseattle opened 6 years ago

jjdseattle commented 6 years ago

It is increasingly common for .NET programmers to write code that interops with systems dealing with text encoded in UTF8. It is non-obvious from its name along that StringComparer.Ordinal can produce different results from an ordinal comparison based on either 8-bit UTF8 code-units or 32-bit code-points. For well-formed UTF8, comparison based on UTF8 code-units matches comparison based on 32-bit code-points. But if a System.String contains a surrogate pair encoding a single code-point, then StringComparer.Ordinal can produce different results.

Please add a new static property to StringComparer for a comparer that does ordinal comparison based on either 32-bit code-points or an 8-bit UTF8 code-units. Possibilities for its name are StringComparer.Ordinal32, StringComparer.OrdinalUtf8. StringComparer.Utf8Ordinal, and many others.

(One interesting spec detail is in decidiing the appropriate behavior when an input System.String contains a "lonely surrogate" and is therefore ill-formed UTF16.)

ahsonkhan commented 6 years ago

cc @krwq, @tarekgh, @joperezr, @Anipik

ahsonkhan commented 6 years ago

Here are the existing properties on StringComparer: https://github.com/dotnet/corefx/blob/fcb0efcd92035c22c93bd7442adbb5a5a99cdffb/src/System.Runtime.Extensions/ref/System.Runtime.Extensions.cs#L947-L952

danmoseley commented 6 years ago

@grabyourpitchforks

tarekgh commented 6 years ago

Just wondering why we need the 32-bit code points comparer too? is it commonly used? I think Utf8 is a higher priority here.

Also, the StringComparer has 3 methods (Equals, Compare and GetHashCode) which take string parameters which I don't think it makes sense to have these for Utf8 (or Utf32). We need to think about that.

jjdseattle commented 6 years ago

For well-formed UTF8, ordinal collation based on UTF8 code-units produces identical results to ordinal collations based on the represented 32-bit code-points. (This statement is one of the design features of UTF8 encoding.) So I'm asking for only one string-comparer. It's name can be chosen to suggest that it performs UTF8-base comparison. Or that it does comparison based on 32-bit code points. The behavior in 8-bit space and 32-space should be identical.

As noted in the original request, from a collation point of view, one interesting question is what do in the presence of a lonely-surrogate in a System.String. If we consider the lonely surrogate as representing a code-point in the range of U+D800 to U+DFFF (rather than treating is as being replaced with Unicode replacement character U+FFFD), then Equals and GetHashCode could share behavior with StringComparer.Ordinal.Equals and StringComparer.Ordinal.GetHashCode. Only StringComparer.Compare() would differ. (However, note that three-byte representations of U+D800 through U+DFFF are not permitted in well-formed UTF8; this fact might be a reason enough to name this Comparer to suggest that it operates in the space of 32-bit code-points rather than 8-bit UTF8 code-units.)

tarekgh commented 6 years ago

For well-formed UTF8, ordinal collation based on UTF8 code-units produces identical results to ordinal collations based on the represented 32-bit code-points. (This statement is one of the design features of UTF8 encoding.) So I'm asking for only one string-comparer. It's name can be chosen to suggest that it performs UTF8-base comparison. Or that it does comparison based on 32-bit code points. The behavior in 8-bit space and 32-space should be identical.

If we expose a comparer, we have to support non-ordinal operations too. that mean UTF32 comparisons will not work as UTF8 comparisons.

GrabYourPitchforks commented 4 years ago

Just to be clear, when you say ordinal comparison by code points, you're asking for something like this?

UTF-16 string [ 1234 E000 D800 DFFF ]
-- the string above should be sorted before the string below --
UTF-16 string [ 1234 D800 DFFF ]
jjdseattle commented 4 years ago

Right.

I proposed a StringComparer with results that would be identical to a comparisons based on 32-bit (UCS4) code points. In 32-bit code points, we see that

U+01234 U+0E000 U+103FF

sorts before

U+01234 U+103FF

.

Converting UCS4 code-points to UTF8 code-units happens to be order preserving. So one could also think of this comparison as occurring in UTF8 space:

E1 88 B4 EE 80 80 F0 90 8F BF

sorts before

E1 88 B4 F0 90 8F BF

.

(I think that producing results identical to a comparison in UTF8 code-unit space is the more compelling motivation for the additional comparer. But UTF8 code-unit space and UCS4 code-point space produce identical results.)

From: Levi Broderickmailto:notifications@github.com Sent: Monday, January 20, 2020 2:39 PM To: dotnet/corefxmailto:corefx@noreply.github.com Cc: Jerry Dunietzmailto:jerry.dunietz@live.com; Authormailto:author@noreply.github.com Subject: Re: [dotnet/corefx] Add a StringComparer static property for an ordinal comparer based on 32-bit code-points or 8-bit UTF8 code-units (#31443)

Just to be clear, when you say ordinal comparison by code points, you're asking for something like this?

UTF-16 string [ 1234 E000 D800 DFFF ]

-- the string above should be sorted before the string below --

UTF-16 string [ 1234 D800 DFFF ]

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fcorefx%2Fissues%2F31443%3Femail_source%3Dnotifications%26email_token%3DAJ663EWLFBMOWSUBVMTF4J3Q6YR3HA5CNFSM4FMTDY72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJN7DNA%23issuecomment-576450996&data=02%7C01%7C%7Cc561b07cf893472749f508d79df9a7ce%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151567887939643&sdata=QhJ5GbE8NN%2FWS2vXsOZcUhqOkU7oqdoH9LUDhKjfoAM%3D&reserved=0, or unsubscribehttps://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJ663EQBWMOUODTIYZFDVI3Q6YR3HANCNFSM4FMTDY7Q&data=02%7C01%7C%7Cc561b07cf893472749f508d79df9a7ce%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151567887949652&sdata=5jAAJ4jzghad3P0VHRSZLTYT3Q4ZsoUTDpR%2FE7LBlgE%3D&reserved=0.

GrabYourPitchforks commented 4 years ago

Collation is normally performed in a linguistic-aware fashion. Existing linguistic-aware comparers should already handle this case correctly by decoding multi-code unit sequences for both UTF-8 and UTF-16.

Most people use linguistic-unaware collation as a glorified memcmp operation, which isn't quite meaningful when comparing two texts with different encoding. What scenario do you have in mind when performing such a collation against mixed UTF-16 and UTF-8 data?

jjdseattle commented 4 years ago

I’m concerned with building systems that must collate in a compatible manner to mon-CLR code that in turn is implemented based on linguistic-unaware UTF8 comparison.

Admittedly, as CoreFx support for UTF8 encoding grows, the need for such a compatible collation of UTF16-encoded string is reduced. Still, it would be valuable to be able do a UTF8-compatible ordinal collation without the performance (including memory) overhead of actually converting to UTF8 or UCS4 in order to do that collation.

From: Levi Broderick notifications@github.com Sent: Monday, January 20, 2020 3:22 PM To: dotnet/corefx corefx@noreply.github.com Cc: Jerry Dunietz jerry.dunietz@live.com; Author author@noreply.github.com Subject: Re: [dotnet/corefx] Add a StringComparer static property for an ordinal comparer based on 32-bit code-points or 8-bit UTF8 code-units (#31443)

Collation is normally performed in a linguistic-aware fashion. Existing linguistic-aware comparers should already handle this case correctly by decoding multi-code unit sequences for both UTF-8 and UTF-16.

Most people use linguistic-unaware collation as a glorified memcmp operation, which isn't quite meaningful when comparing two texts with different encoding. What scenario do you have in mind when performing such a collation against mixed UTF-16 and UTF-8 data?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fcorefx%2Fissues%2F31443%3Femail_source%3Dnotifications%26email_token%3DAJ663ET7RMZYXY7R7F7WYTTQ6YWY7A5CNFSM4FMTDY72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJOA6BY%23issuecomment-576458503&data=02%7C01%7C%7C35870abbdce64a6db0b208d79dff881b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151593126518109&sdata=W4CoK%2BDJ3IZoW32fkd47smNtfVjKn%2B4cDlqPNo%2BBjGE%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAJ663EXEK2AHRMGBXFV3ASDQ6YWY7ANCNFSM4FMTDY7Q&data=02%7C01%7C%7C35870abbdce64a6db0b208d79dff881b%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637151593126528103&sdata=c2sz4XNJ3Jvh0MtF2yg6BbmXX0HXKOeNP%2FuJfWjcjGk%3D&reserved=0.