Open jakobbotsch opened 2 years ago
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis See info in area-owners.md if you want to be subscribed.
Author: | jakobbotsch |
---|---|
Assignees: | - |
Labels: | `area-System.Text.Json`, `untriaged` |
Milestone: | - |
This is by-design. The preferred encoding for SString
is UTF-16 and it is sticky if I recall. Starting with UTF8 is what you likely want here. Also, at the end you can always convert the string fully to UTF8 by calling GetUTF8()
.
Can you provide an example on precisely what you want and how the SString
was instantiated?
The preferred UTF-16 encoding for SString should come into play only when there is encoding mismatch. For example, when one is trying to append UTF-16 string to UTF-8 string.
When I have UTF-8 string and trying to append another UTF-8 string to it, the result should stay as UTF-8. There is no reason for the result to get converted to UTF-16 (what is happening here).
Can you provide an example on precisely what you want and how the
SString
was instantiated?
I was trying to switch CEEInfo::appendClassName
to UTF8, and my initial straightforward/naive implementations slowed down the JIT by an unexpected amount in checked builds (~25% more time taken to JIT SPC with PMI.dll on x86). When I stepped in I noticed this particular behavior for the UTF8-UTF8 string case, but I was not hitting this, I don't think there is an issue if the UTF8 strings stay within ASCII.
These are the versions I tried:
Version 1 just used TypeString
and SString::GetUTF8()
. The GetUTF8()
here was very hot, 25% of the time in the profile I captured was spent inside that.
For version 2 I tried to avoid TypeString
after discussing with @jkotas, since we thought it was related to the fact that it builds the string as UTF16 and thus incurs some unnecessary conversions. As you can see, I just instantiated it with StackSString ss;
, maybe I should have used something else here?
For version 3 I avoided SString
and used an append lambda function to copy directly into the buffer argument. This should avoid (at least) one copy, so it should be faster.
Calling these three versions of appendClassName
in a loop from JIT 100000 times with the System.Runtime.CompilerServices.CastHelpers
class as an argument gives the following results on x86:
Version 1: 2391 ms
Version 2: 1656 ms
Version 3: 172 ms
It is quite unexpected to me that the first two variants are so much more expensive. The contracts add some overhead, but it also looks like the AppendUTF8
path ends up creating quite a few temporary copies.
The preferred UTF-16 encoding for SString should come into play only when there is encoding mismatch. For example, when one is trying to append UTF-16 string to UTF-8 string.
I agree with this. If I recall the issue is that SString
, by default, starts out as "Unicode". By this, I mean if the default constructor is used it will be in "Unicode" mode and when calling an AppendUTF8
, it is appending to "Unicode" and therefore by design. If the constructor for specifying UTF8 is used, then the AppendUTF8
should behave as described. This is what I recall seeing. The pattern I was thinking of is SString str{SSttring::Utf8, ""};
It is quite unexpected to me that the first two variants are so much more expensive. The contracts add some overhead, but it also looks like the AppendUTF8 path ends up creating quite a few temporary copies.
On this we completely agree. I very much dislike the SString
as is and it is one of my longer term goals to slowly improve the API so it is more performant and intuitive.
I agree with this. If I recall the issue is that SString, by default, starts out as "Unicode".
SString starts as REPRESENTATION_EMPTY
without specific encoding.
If the constructor for specifying UTF8 is used, then the
AppendUTF8
should behave as described.
Well, this is what I was reporting -- it doesn't seem to. It calls Append(SString&)
and that function calls End()
which converts multi-byte 'this' into UTF16. For example:
SString str(SString::Utf8, "Planlægnings");
str.AppendUTF8("periode"); // 'str' is converted to UTF16 inside this call, before starting to append "periode"
@jakobbotsch I see the issue now. This is clearly unintuitive and I agree with your expectations here. Based on how this was implemented I get the why, but it shouldn't be this way. I'll add it to my list of things to fix up.
@AaronRobinsonMSFT can this be closed based on #76571, or is there more work required here?
I need to get back to this. I don't believe it is fixed yet.
I was trying to use some of the UTF8 string handling over in #76505 and noticed that
AppendUTF8
in turn calls theAppend(SString&)
overload. This overload callsEnd()
that converts non-ascii UTF8 back to Unicode. This behavior seems quite odd, I would expect appending an UTF8 string to an UTF8 string to stay in UTF8.