Open GrabYourPitchforks opened 6 years ago
Notes for our initial review are here.
The indexer of Utf8String
is not consumable in VB.
public ref readonly Utf8Char this[int index] => throw null;
Please add the following member to solve this problem:
[System.Runtime.CompilerServices.SpecialName]
public Utf8Char get_Chars(int index) => throw null;
This is what it looks like in VB:
Public ReadOnly Property Chars(index As Integer) As Utf8Char
Has there been any update on this in general and what other considerations/design changes have happened since the initial review?
The current implementation in the NuGet package is vastly different from the proposed API.
The NuGet package generally follows the proposal in https://github.com/dotnet/corefxlab/issues/2350, which is where most of the discussion has taken place. It's a bit aggravating that the discussion is split across so many different forums, I know. :(
The next steps on this are to:
We may also want to evaluate alternative that does not introduce the Utf8String type at all. We had a good discussion about it in https://github.com/dotnet/corefxlab/issues/2350 recently.
I noticed https://github.com/dotnet/corefxlab/issues/2350 just got closed. Did the discussion moved somewhere else about more UTF8 first citizen support efforts?
@ceztko The corefxlab repo was archived, so open issues were closed to support that effort. That thread also got so large that it was difficult to follow. @krwq is working on restructuring the conversation so that we can continue the discussion in a better forum.
Collect the list of scenarios where the Utf8String type is desired from our partner teams
I'm currently (pre-)processing multi-TB data sets in C#. I have to match and join millions of strings, which are taking up A LOT of memory (100+ GB). Because my machine only has 64GB of memory, I had to switch to a more efficient string representation:
byte[]
backed string-type, with hash code, comparisons, etc.It would've saved me days of work if UTF8 string representation was a runtime configuration switch.
Identify the acceptance criteria for the scenario
Main goal: Only use 50% of the memory with equal performance in managed code. I don't care about marshalling performance. Having to change the type in all existing code would be doable, but not ideal.
We may also want to evaluate alternative that does not introduce the Utf8String type at all. We had a good discussion about it in dotnet/corefxlab#2350 recently.
Personally, I'd strongly prefer this approach.
Main goal: Only use 50% of the memory with equal performance in managed code.
Can you define what "equal performance" means? What operations are you commonly performing against this data?
As a concrete example, checking whether two strings are case-insensitive equal is fairly fast given UTF-16 representation, and you can even perform early exits like "if the two strings are different lengths, they can't possibly be equal under a case-insensitive comparer." With UTF-8, these assumptions are not valid, so operations like checking for case-insensitive equality are more complex.
If you know that your data is ASCII (not UTF-8), we're considering a bunch of APIs that can accelerate common operations over such data. Check out the work being done over at https://github.com/dotnet/runtime/issues/28230 and see if that might benefit your scenarios.
Main goal: Only use 50% of the memory with equal performance in managed code.
Can you define what "equal performance" means? What operations are you commonly performing against this data?
Sure, here are some things we use:
Dictionary
"
with ""
string[]
and List<string>
(Ordinal, case-sensitive)\n
-terminated lines from compressed data streamsAs a concrete example, checking whether two strings are case-insensitive equal is fairly fast given UTF-16 representation, and you can even perform early exits like "if the two strings are different lengths, they can't possibly be equal under a case-insensitive comparer." With UTF-8, these assumptions are not valid, so operations like checking for case-insensitive equality are more complex.
Good point. However, I don't think we're affected: We don't ever change casing, and all comparisons are case-sensitive (and ordinal). All of our strings are ASCII anyway.
If you know that your data is ASCII (not UTF-8), we're considering a bunch of APIs that can accelerate common operations over such data. Check out the work being done over at #28230 and see if that might benefit your scenarios.
Thanks for the pointer. Interesting proposal, but probably not that helpful for our purposes, as far as I can tell. I don't think the proposal contains any functionality that would be useful to us.
Maybe we need to step up, if we've been talking about an api for a couple of years, it's going to be as slow and inefficient as the C-plusplus standard. Obviously, UTF8String saves memory to some extent and speeds up processing of specific scenarios without consuming cpu and additional memory for conversion.
Make UTF8String as an option for developers.
@sgf we currently are discussing options here. We need to be really careful what we do with UTF-8 String because we do not want to duplicate all String APIs with Utf8String overloads but at the same time we do want UTF-8 Strings. Just swapping internals of string will break lots of apps at the moment as there is plenty of them relying on things like: fixed (char* foo = someString)
which might cause really bad bugs. Once we analyze all options we will figure out what the next should be and based on that we will make a call if we will do this work soon, gradually or not do it at all. With the experimental features available now there is a chance something will show up next release but we do not have any firm decision either way.
What are the chances of getting UTF8 strings into the 7.0.0 milestone?
sorry for the low effort comment, I have not read the whole thread but my opinion is that an external Utf8string is not convenient to use regarding discoverability and readability (explicitly expose an implementation detail while the developper intent is just to mean a String, which increase cognitive overhead) and for those reasons, usage will be niche/seldom used. A better solution would be to make Strings UTF 8 by default like Java 18 https://news.ycombinator.com/item?id=31024255 OR at the very least, provide a global setter (can be called at program initialization, or a compiler flag) that set the String representation mode for any future String that will be constructed. As such, people would opt-in to UTF-8 by default in their projects and experience better performance/lower memory for most usages. You can still introduce an explicit Utf8 type (useful when a program has uses for mixing multiple representations) but the proposed global optin config flag would fit the mainstream use case that most people desire since the dawn of time.
Does the internal representation of String remain UTF-16 or Latin-1 (Compact String) in Java 18? Only default encoding might be changed to UTF-8.
Yes UTF-8 is not the default internally as of now although they use latin-1 for characters that fits in the ASCII table
If roles becomes a thing, would it make sense to have Utf8String
be a role for ReadOnlySpan<byte>
? It wouldn't need to be a full type and would work with existing utf8 string usage.
The primary problem is that all of today's useful API is defined on string
, not on Utf8String
. Requiring conversions for all existing functions defined on string
would cost too much performance to be useful, in my use cases.
However, one could augment all the existing types and methods with Utf8String
overloads. Because that seems like a lot of work which will never be complete, I favor a runtime switch instead, where string
s internal backing memory can be switched to UTF-8. This way few things in managed code would have to be implemented twice. Vectorized string code and runtime functions would have to be adapted, and conversions become necessary cost when using Win32 UTF-16 APIs.
I understand from https://github.com/dotnet/corefxlab/issues/2350 that the Utf8String
proposal has been either dismissed or put on hold at best. If the former, I recommend to close this issue as well. I ask again if further public discussions about UTF8 first citizen support (in whatever form this will be actually implemented) can be found elsewhere.
.NET does not need Utf8String and rest bullshit mentioned here. Just give us Variable-Sized Types (like string). Rest is doesnt matter.
@timcassell Exactly.
Instead of introducing a Utf8String class, I think we should utilize the roles
feature which may be shipped in future release of C# to introduce a Utf8String
role of ROS<byte>
/Span<char>
/ROM<byte>
/Memory<byte>
/byte[]
.
No breaking changes required, same functionality provided, more light-weighted than classes.
Utf8String are required as identifiable on-heap type, and it can't substituted by byte[] by normal means. Byte[] is what already can be used and is brutally not compatible with diagnostics/profiling. This why VST needed.
If we introduce a role type from runtime, then the debugger/profiler can recognize the role type to provide diagnostics functionality.
@hez2010 i'm not against roles. I'm just point what named heap store (e.g. Utf8String) is useful for memory diagnostics. Is not semi/exclusive things.
With static interfaces wouldn't it be possible to Introduce an IString<T>
Interface like INumber<T>
and Implement this on String and UTF8String so that String Apis could be written that use an IString<T>
Type. Maybe an IChar<T>
Interfaces needs to be introduced to handle Char Operations (UTF8Char and Char implements IChar<T>
).
Another option instead of a runtime switch is to add new Utf8String
and Utf8Char
types and add a compiler switch for C# to use these types instead of the old ones when using the keywords string
and char
.
I find it hard to imagine a runtime switch to change the representation of System.String
to UTF-8, because that would mean that indexing the string is now O(n) instead of O(1). Unless char
is also changed to be 1 byte, but that would be a huge breaking change and basically every existing app would now be broken.
@Neme12 No it isn't. A UTF-8 string would not index into the code points, but into the raw bytes. It also would not change System.String
, but rather the realias string
to a brand new type. Indexing code points should not be in the BCL, but be done thru an an extension method from an additional nuget package, as 99% of people shouldn't be worrying about the code points unless they're rendering the string, validating, or normalizing, none of which the BCL should be doing. Parsing things like CSVs, HTTP, source code is also fine, it's safe to assume the string contains only 7-bit ASCII.
For cases of normalization, parsing, runes and such, it's always a cat and mouse game, with new "runes" added all the time, which is why I say it's not suited for the BCL itself.
@AshleighAdams
Indexing a standard UTF-16 string doesn't give you code points, it gives you just 16-bit code units. A lot of Unicode characters (code points) take two 16-bit code units. UTF-16 is not different from UTF-8 on this matter. Only UTF-32 allows to index code points directly.
No it isn't.
Which part were you responding to?
@vrubleg I never said the built in string gives you code points? It just adds to my reasoning that a BCL UTF-8 string should not try to enumerate code points, giving only O(1) access to the raw bytes
It gives O(1) access to code units that is consistent with standard UTF-16 strings. Yes, in case of UTF-8 code units are equal to raw bytes, and in case of UTF-16 code units are equal to raw words. So what?
@vrubleg I'm refuting @Neme12's assertion that indexing the string would be O(n)
Probably, @Neme12 meant that it would be O(n) if just internal representation of System.String
was changed to UTF-8 while maintaining external visibility as a normal UTF-16 string. Hence the proposal to introduce a separate type System.Utf8String
and a compile time switch for string
to represent either System.String
or System.Utf8String
(that could be even per file, like nullable
).
@vrubleg Oooh my bad, I misinterpreted what they were saying. I apologize, and am in agreement then 🙈
Hence the proposal to introduce a separate type
System.Utf8String
and a compile time switch forstring
to represent eitherSystem.String
orSystem.Utf8String
(that could be even per file, likenullable
).
@vrubleg If viable this seems a great solution. It's not clear if this is currently the official Microsoft stance, though: are there more references on this?
A Utf8String type would be a good additional. ""u8 to a ROS isn't great an having a great option to basically pass the ROS of that type to be written to a stream where very often the encoding is utf8 this would be a huge gain.
There exists https://github.com/U8String/U8String but likely .net internals could also benefit from a native utf8 string type.
@ramonsmits Interesting, though if a standard Utf8String type is added, I don't think any type of parsing should be added like that library does
People using UTF8 strings are in 99% of cases just moving the string around, or doing simple ASCII parsing. Very rarely are people actually rendering it themselves.
I think the type .NET gets (if it does get one) should have constant time access to each byte
/Char8
, along with an EnumerateCodepoints()
method, as the UTF-8 spec is fixed with how those are parsed, but trying to do runes in the language's runtime/BCL would be a mistake. String normalization, understanding how many code points form a "single visible character" is hard, error prone, and a lot of data to store. Instead I think those can be provided via an extension method in a NuGet package, and that package could then move fast without being tied to a specific .NET version or what not
Edit: Actually no, I misunderstood, a rune is a Unicode codepoint, in which case that library is perfect I think, not over-extending into areas it shouldn't, nice!
AB#1117209 This is the API proposal for
Utf8String
, an immutable, heap-allocated representation of UTF-8 string data. See https://github.com/dotnet/corefxlab/issues/2368 for the scenarios and design philosophy behind this proposal.Included in this are also APIs to improve text processing across the framework as a whole, including changes to existing types like
String
andCultureInfo
.Edits:
Nov. 8 - 9, 2018 - Updated API proposals in preparation for upcoming review.