dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.42k stars 4.76k forks source link

Utf8.TryWrite applies alignment by counting bytes instead of characters #109615

Open jdryden572 opened 2 weeks ago

jdryden572 commented 2 weeks ago

Description

When using Utf8.TryWrite to write an interpolated string as UTF8 bytes, passing an alignment value with any of the formatted values does not always result in the same amount of padding as when using string.Format or default string interpolation. If the formatted value has any non-ASCII characters then then less padding will be added.

Reproduction Steps

using System;
using System.Text;
using System.Text.Unicode;

string[] examples = new[]
{
    "\u0108",       // Ĉ 1 char, 2 bytes UTF8
    "\u20ac",       // € 1 char, 3 bytes UTF8
    "\ud83d\ude00", // πŸ˜€ 2 chars, 4 bytes UTF8
};

foreach (string s in examples)
{
    Console.WriteLine($"utf16: [{s,4}]");
}
foreach (string s in examples)
{
    Span<byte> span = new byte[8];
    Utf8.TryWrite(span, $"[{s,4}]", out int written);
    Console.WriteLine("utf8:  " + Encoding.UTF8.GetString(span.Slice(0, written)));
}

Expected behavior

Formatting a value with an alignment in Utf8.TryWrite should produce the same amount of padding in UTF8 as is added in other .NET string (UTF16) formatted strings.

For the code snippet above, it should produce:

utf16: [   Ĉ]
utf16: [   €]
utf16: [  πŸ˜€]
utf8:  [   Ĉ]
utf8:  [   €]
utf8:  [  πŸ˜€]

Actual behavior

When the formatted value includes any characters that require more than 1 byte in UTF8 encoding, the alignment is incorrect and produces less padding in Utf8.TryWrite.

For the code snippet above, it produces:

utf16: [   Ĉ]
utf16: [   €]
utf16: [  πŸ˜€]
utf8:  [  Ĉ]
utf8:  [ €]
utf8:  [πŸ˜€]

Regression?

This has been the behavior since the Utf8.TryWrite API was introduced in .NET 8, and it is also reproducible in .NET 9.

Known Workarounds

If the correct padding is really needed, default string interpolation or formatting can be used to format the value as UTF16 in a string or a Span<char>, and then that UTF16 can be encoded into the UTF8 output Span<byte> using Encoding.UTF8.GetBytes.

This loses the nice ergonomics of formatting directly into the UTF8 buffer, and either allocates (if making a string) or requires more buffer management to get a Span<char>.

Configuration

Tested in .NET 8 & .NET 9 Preview I'm using Windows x64, but I'm pretty sure this is not platform/arch dependent.

Other information

Before I begin -- I am interested in trying to fix this and I'm happy to open a PR for it. It would be my first time contributing however, so I understand if you feel someone else should handle fixing it instead.

The issue is that the amount of required padding is being determined by counting how many bytes were written, even though we're working with UTF8 where many characters take more than one byte. Here's the culprit: https://github.com/dotnet/runtime/blob/e133fe4f5311c0397f8cc153bada693c48eb7a9f/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8.cs#L778

The simple (and maybe too naiive) approach to fix this would be to use Encoding.UTF8.GetCharCount on the slice that was written, to measure how many chars the formatted text ended up writing. But this private method is called by multiple overloads of AppendFormatted and for some of them, we alread know how many chars we wrote. For example, if the value being formatted is a ReadOnlySpan<char> or string, we know how many chars it had. Or if it was ISpanFormattable, we already formatted it into our own Span<char> buffer before writing and know how many chars there are.

So I think a better solution might be to find a way to have the overloads pass an optional int charsWritten if they know how many there were. If not, the alignment handling should call Encoding.UTF8.GetCharCount on the bytes we wrote so far to calculate how many chars it ended up being.

dotnet-policy-service[bot] commented 2 weeks ago

Tagging subscribers to this area: @dotnet/area-system-text-encoding See info in area-owners.md if you want to be subscribed.

tarekgh commented 2 weeks ago

CC @stephentoub

tannergooding commented 2 weeks ago

This isn't actually any different from string.Format, you can see the same occur if you use any surrogate characters or non-combining marks

Consider

Console.WriteLine($"{"A",5}");   // Length: 1
Console.WriteLine($"{"πŸ‘¨β€πŸ‘¦",5}");  // Length: 5

//    A
//πŸ‘¨β€πŸ‘¦
tannergooding commented 2 weeks ago

The main consideration is that it is easier to see such points of failure with UTF-8 because more things require multiple characters to represent.

Even with Rune where everything is functionally "1 character" you would have the same issue, because there is a distinction between "number of code points" and "amount of visual space taken". -- A code point is a singular Unicode value in the range 0 to 0x10FFFF, it may be represented by 1 or more characters depending on the encoding (UTF-8 is 1-4 characters per code point, UTF-16 is 1-2 characters per code point, UTF-32 (effectively Rune) is always 1 character per code point).

πŸ‘¨β€πŸ‘¦, for example, is 3 unicode code points represented by man + zero-width joiner + boy ("πŸ‘¨" + "‍" + "πŸ‘¦", where the middle one there is not an empty string, its an invisible code point that takes zero visual space).

jdryden572 commented 2 weeks ago

Thanks for the reply, @tannergooding!

I agree that the notion of "alignment" in format strings is more complicated than simply counting visual characters, and your example is a very good one. However, I disagree with this statement:

This isn't actually any different from string.Format ...

Because the problem (as I see it) is that they are in fact producing something different. It's more apparent if we add enough padding to see at least one padding character in each encoding:

Span<byte> span = new byte[16];
Utf8.TryWrite(span, $"[{"πŸ‘¨β€",8}]", out int written);
Console.WriteLine(Encoding.UTF8.GetString(span.Slice(0, written)));
Console.WriteLine($"[{"πŸ‘¨β€",8}]");

// [ πŸ‘¨β€]
// [     πŸ‘¨β€]

My hope was that Utf8.TryWrite would produce the same amount of padding as string.Format, even if the latter has its own quirks regarding alignment on values that cannot be represented in one UTF16 character.

elgonzo commented 2 weeks ago

This issue is not just about surrogate pairs. Surrogate pairs are by definition not a single code point. But two of the character examples given in the issue report are single code points, not pairs of code points or pairs of characters with a combining mark or some such.

While i can (somewhat) understand the behavior regarding surrogate pairs and any sequence of two or more code points that renders as a single character, what exactly is the useful purpose of a padding algorithm which makes its behavior dependent on the bit-length/number of the underlying code units (octets) for a single code point? It's string interpolation and not byte buffer interpolation, after all.

tannergooding commented 2 weeks ago

Edit: Adjusted to use the term "scalar value" which is more appropriate than code point.

Because the problem (as I see it) is that they are in fact producing something different

They are producing different outputs, but still behaving identically. There is a difference in the concepts.

The behavior for alignment is that it is in respect to the number of characters. UTF-8, UTF-16, and UTF-32 use a different number of characters to represent the same data; thus an alignment of 5 with a UTF-8 string is going to often be different from an alignment of 5 for a UTF-16 string, because the length of each differs even if they appear visually identical.

Surrogate pairs are by definition not a single code poi

This is incorrect, what you're thinking of is Unicode Characters. As per https://unicode.org/glossary/

The terminology that various programming languages use and that occurs in common conversation tends to be much more loose, with things getting mixed up and used interchangeably. What is formally defined by the Unicode spec here, and in terms of how .NET operates is much more strict.

.NET strings have never worked on code points scalar values as fundamental units; they have always been a representation of individual characters. This in turn shows up as part of indexing, default comparisons, and so on. The use of UTF-16 by default and most typical text not requiring multiple characters or code points to represent a single visual glyph has lead to incorrect presumptions (and often bugs) when such text is encountered, particularly as emoji has become more commonplace.

tannergooding commented 2 weeks ago

There are in many cases overloads of functions, additional parameters you can pass in, or alternative APIs you can use that do try and operate in terms of more "visual" information or in terms of the abstract data that is represented; such that for example Γ  (the individual code point 0x00E1) and aΜ€ (the letter a, 0x0061, + the combining diacritical grave accent β—ŒΜ€, 0x02CB) are treated as eqaul.

But, such a scenario does not currently exist for string.Format and the alignment functionality that is available for it. Such a request would be new and would need to consider many things, including concepts like combining characters, zero width joiners, and similar.

elgonzo commented 2 weeks ago

Surrogate pairs are by definition not a single code point

This is incorrect

No, it's not incorrect. Excerpts from the same standard you linked to (https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3):

Section 3.2.1

C1 A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. The high-surrogate and low-surrogate code points are designated for surrogate code units in the UTF-16 character encoding form. They are unassigned to any abstract character.

Section 3.4

D10a Code point type: Any of the seven fundamental classes of code points in the standard: Graphic, Format, Control, Private-Use, Surrogate, Noncharacter, Reserved.

Section 3.8

D71 High-surrogate code point: A Unicode code point in the range U+D800 to U+DBFF.

D73 Low-surrogate code point: A Unicode code point in the range U+DC00 to U+DFFF.

elgonzo commented 2 weeks ago

.NET strings have never worked on code points as fundamental units;

I don't know what you are trying to say, but .NET is disagreeing with you: https://dotnetfiddle.net/taXr3E

tannergooding commented 2 weeks ago

You're taking some data out of context there, and with respect to how .NET (and most programming languages) operate.

Character can be a synonym for abstract character (definition 3), it can equally refer to the basic unit of encoding (definition 4). For UTF-16, a surrogate pair is represented by two basic units of encoding which are System.Char in .NET.

.NET strings (for indexing, alignment, and many other operations) deal in terms of the basic units of encoding and thus a surrogate pair is "2 .NET characters".

We are not treating these as "abstract characters", but as characters which represent the basic unit of encoding.

tannergooding commented 2 weeks ago

I don't know what you are trying to say, but .NET is disagreeing with you: https://dotnetfiddle.net/taXr3E

πŸ˜€ or the grinning face emoji is formally defined by the single unicode code point: 0x1F600

In order to be encoded as UTF-16, it must be represented by two basic units of encoding defined as a surrogate pair: 0xD83D, 0xDE00

From a technical perspective, surrogate pairs are also code points; but they are unique/special and are only valid as UTF-16; they are not part of the broader Unicode spec and other encoding forms; they represent invalid code points for UTF-8 and UTF-32.

aka a Unicode scalar value is a code point, minus the surrogates which are unique to UTF-16

elgonzo commented 2 weeks ago

You're taking some data out of context there

I am not taking anything out of context here. You claimed something i wrote to be incorrect, despite it being defined in the specification (that you yourself linked to). Yet, you claim that's me who takes things out of context here. Okay...

πŸ˜€ or the grinning face emoji is formally defined by the single unicode code point: 0x1F600

It's not about some other code point. Stop deflecting. I just showed you an example using a character represented as a surrogate pair that .NET strings indeed operate on the code-point level.

I don't even know anymore what you are trying to argue here. I am out, there is no basis for a reasonable discussion like this...

tannergooding commented 2 weeks ago

The general point, regardless of terminology, is that the above behavior is correct and by design.

If you would like for a different behavior to be possible, that would be a new feature request. It would need to determine whether alignment operates in terms of characters, code points, scalar values, visual glyphs, or one of the many other possible ways to interpret the data that represents a string.

elgonzo commented 2 weeks ago

The general point, regardless of terminology, is that the above behavior is correct and by design.

Yeah, and is it documented? No? Don't worry, no biggie...

tannergooding commented 2 weeks ago

There is fairly comprehensive documentation covering how strings work, how composite formats work, etc

https://learn.microsoft.com/en-us/dotnet/fundamentals/runtime-libraries/system-string, https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding-introduction, https://learn.microsoft.com/en-us/dotnet/standard/base-types/composite-formatting#alignment-component, etc

There are always opportunities to improve things further, cross reference data in more places, etc. As is, this behavior is covered in relation to the discussion of alignment operating with regards to the string length and the definition of a strings length being tied to the number of basic units of encoding (i.e. 8-bit components for UTF-8, and 16-bit components for UTF-16; hence why string indexing returns a chart and why UTF-8 strings are represented as ROSpan<byte>).

Contributions are welcome

elgonzo commented 2 weeks ago

Nothing of the documentation links you have given explains or gives clues of how string interpolation alignment works with respect to Utf8.TryWrite dependent on the number of code units. There is no point throwing documentation links around that do not contain any of the documentation i claimed to be not there.... What the... is even going on here...???

This behavior is already in .NET 8. .NET 9 will shortly be released. But when you move fast you gotta have no time to document things...

Contributions are welcome

What is this? You design, write, and release an API, and then you ask the clueless users that just so happen to get surprised about its behavior to write its documentation? Because apparently surprised users are better experts in correctly explaining the API than the creators of the API themselves. Oh man...

tannergooding commented 2 weeks ago

The links there are to the broader conceptual docs that discuss how strings work in .NET. They reference the various Unicode specs, give examples of terminology, etc.

Utf8.TryWrite takes in a Utf8.TryWriteInterpolatedStringHandler which documents its handling of alignment as

Minimum number of characters that should be written for this value. If the value is negative, it indicates left-aligned and the required minimum is the absolute value.

There is potential for additional cross references, links, or remarks to be added if a user believes it to be confusing, but all the information does exist and is available.

elgonzo commented 2 weeks ago

Minimum number of characters

But that is, according to your own statement that the API works as intended, obviously not true. Because if this statement in the documentation were true, than the observed behavior should match more closely with the the expected result given in the issue report above. So what is now the correct behavior? You guys know. Don't write it here in the issue report comments, write it into the documentation. Please.

huoyaoyuan commented 2 weeks ago

The current behavior of padding of interoplated string formatting is consistent for me - not applying to anything else. They both count in encoded length in current UTF, not the count of Unicode scalars. Counting in Unicode scalars is a new behavior that may meet some expectations, but may break others.

than the observed behavior should match more closely with the the expected result given in the issue report above

Expectation is very subjective. String length is always a complex thing with Unicode.

Don't write it here in the issue report comments, write it into the documentation. Please.

Agreed that documentation should be explicit for detailed behaviors.

colejohnson66 commented 2 weeks ago

Regardless of UTF-8 or UTF-16, alignment itself is a mixed bag β€” it's designed to align, but doesn't consider visual size because it can't. That's a property of the font used to render. The existing method of counting UTF-16 codepoints works great because it's likely that the majority of characters used are in the BMP, and the majority of those are single-width on terminals. So, everything working with UTF-16 is almost a coincidence. UTF-8 aligning by bytes (not codepoints) (while an English-centric view), just exposed the problem of alignment spaces.

I agree with the author that alignment should probably consider codepoints, not bytes, but if that's done, why not do it as well for UTF-16 codepoints? Except we can't because that might break existing code. As such, the behavior, while counterintuitive, makes sense and is consistent.

jdryden572 commented 2 weeks ago

Thank you to everyone for the vigorous discussion! If it is the consensus here, I can accept that alignment in Utf8.TryWrite is working as intended. Obviously it is not what I expected -- but as @huoyaoyuan wisely mentioned everyone brings their own expectations to the table. Consider this comment my final appeal before I drop the issue.

I would offer for your consideration that we could standardize the behavior of alignment around how it works for string, including and accepting the quirks due to that implementation counting UTF-16 codepoints. This has a strong precedent, considering how long it has been around for string.Format and default string interpolation, and is the behavior described in the composite formatting documentation which Utf8.TryWrite implicitly inherits since it is a string interpolation handler. An obvious downside to this would be that the existing behavior is already in the wild in Utf8.TryWrite today, and hypothetically someone could be relying on the byte-counted padding. I imagine such a change would require stronger justification than what I am presenting here, but I wanted to state my case.

One thing to note is that the state of the tests for this API all employ a test strategy of formatting to UTF-8 and a StringBuilder at the same time, then calling Encoding.UTF8.GetString to assert that the resulting UTF-16 string matches the one produced by StringBuilder. The way these tests are written strongly implies that the intention for this API was to match the behavior of default string-based interpolation and formatting. I interpreted these tests to mean that the absence of cases that involve multiple-byte UTF-8 characters was simply an oversight.

If we want the current behavior to remain unchanged, I can work on proposing documentation updates to clarify that. Would it also be appropriate to add additional tests that enshrine this requirement? If so, I could potentially propose those updates too.

elgonzo commented 2 weeks ago

The existing method of counting UTF-16 codepoints works great because it's likely that the majority of characters used are in the BMP, and the majority of those are single-width on terminals.

And that's why System.String-based interpolation is useful in a practical and intuitive manner. What exactly is the practical use of alignment here for Ut8.TryWrite, if you can't get the actual text result you want without having to analyze the number of code units of the characters consumed by an interpolation expression with alignment and then somehow adjust alignment parameters dynamically on the fly?

but doesn't consider visual size because it can't

Bringing visual size of glyphs into this discussion is a distraction. Alignment on a character (code-point) level is a very useful feature that is practical to use in System.String-based interpolations. Monospaced fonts are a thing, not only in console. Many .txt files in the world are formatted in way to be rendered using monospaced fonts.

I agree with the author that alignment should probably consider codepoints, not bytes, but if that's done, why not do it as well for UTF-16 codepoints? Except we can't because that might break existing code. As such, the behavior, while counterintuitive [...]

I agree, it's very likely a case of what done is done, implementation-wise. But precisely because it is counterintuitive and it diverging from the precedent of System.String-based interpolation, it needs to be clearly and visibly documented.

People are used to System.String-based interpolation and formatting for many years now already, how it works is intuitively understandable, and it's practical to use. They are used to craft text on a character level using string interpolation. From this, expectations are borne. Its not whether you call such expectations subjective. It's a widely used feature (string interpolation) being precedent. The way how alignment (and perhaps other interpolation features? i don't know...) in UTF-8 interpolations work diverges from this precedent and is in conflict with expectations tied to it. And the documentation needs to address this divergence in a clear and discoverable way. Which it currently unfortunately doesn't; by virtue of repeatedly using the term "string interpolation", it "borrows" from the existing knowledge about (and thusly indirectly from the documentation for) the System.String-based interpolation, which is counterproductive as it reinforces very reasonable yet incorrect expectations.

tannergooding commented 2 weeks ago

There remains an incorrect interpretation of how it works for UTF-16 strings. -- part of it likely stems because I incorrectly mixed up the terms between code point and scalar value above, but also because other comments above are also using or mixing terms in the wrong way (such as treating code point and character as equivalent).

It's also worth calling out early that dealing with code points as suggested, would itself not make UTF-8 produce the same result. A simple example is that πŸ˜€ (a single scalar value) can only be represented in UTF-16 by a surrogate pair (2x code points encoded as 2x basic units of encoding). While in UTF-8 surrogates are illegal/invalid and thus it can only be represented as 1x code point encoded as 4x basic units of encoding.

This would mean that $"{"πŸ˜€",4}" would still fundamentally differ between the two even if the thing being asked for above was implemented. We would instead have to deal with scalar values (aka System.Rune in .NET) for them to always be consistent.


Unicode provides 4 possible definitions for character:

Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]

Unicode has code points:

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.

Unicode has scalar values:

Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)


In .NET, we have UTF-16 strings which are string or ReadOnlySpan<char> and UTF-8 strings which are ReadOnlySpan<byte>. In both cases, these strings are composed of a sequence of basic units of encoding which for many reasons ends up being referred to as characters (hence the type name System.Char). The Length of the string or ReadOnlySpan then tells you how many of these basic units of encoding exist.

While all basic units of encoding for UTF-16 strings happen to be code points, that is not what .NET conceptually has dealt with. This is likewise not what other languages and ecosystems typically deal with, as it adds hidden overhead/cost to other encodings (namely UTF-8) when performing indexing, slicing, or many other common/foundational operations.

The definition for how alignment works in composite formats (i.e. what is used by string interpolation) correspondingly discusses its operation simply in terms of the length of the string:

The optional alignment component is a signed integer indicating the preferred formatted field width. If the value of alignment is less than the length of the formatted string, alignment is ignored, and the length of the formatted string is used as the field width. The formatted data in the field is right-aligned if alignment is positive and left-aligned if alignment is negative. If padding is necessary, white space is used. The comma is required if alignment is specified.

This is relevant because it clearly covers the contract that alignment is compared to Length, and since Length is the number of basic units of encoding this means that the alignment needed to account for Ĉ between UTF-16 (where length is 1) and UTF-8 (where length is 2) differs.

The UTF-8 handling therefore is consistent with the designed behavior for UTF-16 (which is not the same as the behavior some are interpreting it to be above -- i.e. despite appearances, we deal with basic units of encoding not code points here, it is only because all basic units of encoding for UTF-16 happen to be code points that it appears the other way).

I can agree this is unfortunate and likely unexpected for many devs who are switching to UTF-8. There are other cases where that is true as well, such as when indexing, slicing, or performing other operations (because things represented by multiple units of encoding are more rare in UTF-16, but incredibly common in UTF-8).


It might be beneficial for some additional API overload to be added which do deal with other units.

It might be relevant to have operations that work in terms of:

It might also be relevant to include additional cross references (see also links) in some docs to the existing conceptual docs so that the behavior is more clear; but it is likely going to confuse people in many cases regardless due to this simply being a complex space with many nuances and considerations.

Contributions adding links to existing docs or adding remarks that clarify the behavior are welcome. This is all open source and there will always be things that one person thinks of that another doesn't, because every one of us thinks and understands things differently from others.

For new features/APIs, we need API proposals to be opened that follow the relevant template: https://github.com/dotnet/runtime/issues/new?template=02_api_proposal.yml

We cannot change the existing behavior because it would be too fundamentally breaking. It would also end up impacting downstream languages which use these APIs as part of their features (such as C# using it for string interpolation).