dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.9k stars 4.63k forks source link

SString::AppendUTF8 converts 'this' to UTF16 if it is non-ascii #76581

Open jakobbotsch opened 1 year ago

jakobbotsch commented 1 year ago

I was trying to use some of the UTF8 string handling over in #76505 and noticed that AppendUTF8 in turn calls the Append(SString&) overload. This overload calls End() that converts non-ascii UTF8 back to Unicode. This behavior seems quite odd, I would expect appending an UTF8 string to an UTF8 string to stay in UTF8.

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.

Issue Details
I was trying to use some of the UTF8 string handling over in #76505 and noticed that `AppendUTF8` in turn calls the `Append(SString&)` overload. This overload calls `End()` that converts non-ascii UTF8 back to Unicode. This behavior seems quite odd, I would expect appending an UTF8 string to an UTF8 string to stay in UTF8.
Author: jakobbotsch
Assignees: -
Labels: `area-System.Text.Json`, `untriaged`
Milestone: -
AaronRobinsonMSFT commented 1 year ago

This is by-design. The preferred encoding for SString is UTF-16 and it is sticky if I recall. Starting with UTF8 is what you likely want here. Also, at the end you can always convert the string fully to UTF8 by calling GetUTF8().

Can you provide an example on precisely what you want and how the SString was instantiated?

jkotas commented 1 year ago

The preferred UTF-16 encoding for SString should come into play only when there is encoding mismatch. For example, when one is trying to append UTF-16 string to UTF-8 string.

When I have UTF-8 string and trying to append another UTF-8 string to it, the result should stay as UTF-8. There is no reason for the result to get converted to UTF-16 (what is happening here).

jakobbotsch commented 1 year ago

Can you provide an example on precisely what you want and how the SString was instantiated?

I was trying to switch CEEInfo::appendClassName to UTF8, and my initial straightforward/naive implementations slowed down the JIT by an unexpected amount in checked builds (~25% more time taken to JIT SPC with PMI.dll on x86). When I stepped in I noticed this particular behavior for the UTF8-UTF8 string case, but I was not hitting this, I don't think there is an issue if the UTF8 strings stay within ASCII.

These are the versions I tried:

Version 1 just used TypeString and SString::GetUTF8(). The GetUTF8() here was very hot, 25% of the time in the profile I captured was spent inside that.

For version 2 I tried to avoid TypeString after discussing with @jkotas, since we thought it was related to the fact that it builds the string as UTF16 and thus incurs some unnecessary conversions. As you can see, I just instantiated it with StackSString ss;, maybe I should have used something else here?

For version 3 I avoided SString and used an append lambda function to copy directly into the buffer argument. This should avoid (at least) one copy, so it should be faster.

Calling these three versions of appendClassName in a loop from JIT 100000 times with the System.Runtime.CompilerServices.CastHelpers class as an argument gives the following results on x86: Version 1: 2391 ms Version 2: 1656 ms Version 3: 172 ms

It is quite unexpected to me that the first two variants are so much more expensive. The contracts add some overhead, but it also looks like the AppendUTF8 path ends up creating quite a few temporary copies.

AaronRobinsonMSFT commented 1 year ago

The preferred UTF-16 encoding for SString should come into play only when there is encoding mismatch. For example, when one is trying to append UTF-16 string to UTF-8 string.

I agree with this. If I recall the issue is that SString, by default, starts out as "Unicode". By this, I mean if the default constructor is used it will be in "Unicode" mode and when calling an AppendUTF8, it is appending to "Unicode" and therefore by design. If the constructor for specifying UTF8 is used, then the AppendUTF8 should behave as described. This is what I recall seeing. The pattern I was thinking of is SString str{SSttring::Utf8, ""};

It is quite unexpected to me that the first two variants are so much more expensive. The contracts add some overhead, but it also looks like the AppendUTF8 path ends up creating quite a few temporary copies.

On this we completely agree. I very much dislike the SString as is and it is one of my longer term goals to slowly improve the API so it is more performant and intuitive.

jkotas commented 1 year ago

I agree with this. If I recall the issue is that SString, by default, starts out as "Unicode".

SString starts as REPRESENTATION_EMPTY without specific encoding.

jakobbotsch commented 1 year ago

If the constructor for specifying UTF8 is used, then the AppendUTF8 should behave as described.

Well, this is what I was reporting -- it doesn't seem to. It calls Append(SString&) and that function calls End() which converts multi-byte 'this' into UTF16. For example:

SString str(SString::Utf8, "Planlægnings");
str.AppendUTF8("periode"); // 'str' is converted to UTF16 inside this call, before starting to append "periode"
AaronRobinsonMSFT commented 1 year ago

@jakobbotsch I see the issue now. This is clearly unintuitive and I agree with your expectations here. Based on how this was implemented I get the why, but it shouldn't be this way. I'll add it to my list of things to fix up.

mangod9 commented 1 year ago

@AaronRobinsonMSFT can this be closed based on #76571, or is there more work required here?

AaronRobinsonMSFT commented 1 year ago

I need to get back to this. I don't believe it is fixed yet.