Open mattwarren opened 8 years ago
Was updated to Ascii rather than ISO-8859-1 in RFC 7230 3.2.4. Field Parsing.
Latin-1 conflicts with Utf8 whereas Ascii is compatible (so you can cross detect between the two)
Interestingly the current string format does already have an IsAscii flag on it
Though it considers '
and -
to not be Ascii :frowning:
However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed
statements on strings without special handling (and passing to pinvoke etc)
ASCII Strings is something that @evincarofautumn experimented with a lot for Mono: the proposal is here http://www.mono-project.com/docs/advanced/runtime/docs/ascii-strings/ (and a few more details in the mono-devel list thread about it)
I'm sure he can go into great details about the pros/cons 😄
However changing the string to be 8-bit rather than 16-bit based on this flag would likely break
fixed
statements on strings without special handling (and passing to pinvoke etc)
Hmm, I didn't think about compatibility with code that uses fixed
against strings, yeah that's a problem. Looks like the mono experiment solved it by disabling fixed
on strings.
Looks like the mono experiment solved it by disabling fixed on strings.
Could always stackalloc length * 2 bytes for fixed start then copy and widen with _mm_unpacklo_epi8
& _mm_unpackhi_epi8
then compress back at fixed closed. Would be slow though...
Or allow both fixed byte* and fixed char*
Why use ASCII instead of Latin 1 for the 1 byte version? Latin1 will permit to use one byte for a lot of accented characters of Western Europe as 'è', 'ò', 'ù'...
Could have sense to have a third type at this point for UTF32 in the case of strings containing characters that could not represented in UTF16 (Chinese characters, Klingon, Emoticons, ...) if not resorting to the kludge of use surrogate pairs (that are not really supported in .NET[1])? Probably it would not accomplish a lot of compression in this case but the implementation could be more correct.
[1] De facto .NET String are really UCS2 not UTF16.
Why use ASCII instead of Latin 1
ASCII and UTF8 are compatible; Latin 1 and UTF8 are not so require conversions between the two.
UTF8 is just as compact for the non-accented characters and is current the general interchange format, so using Latin 1 would add conversion load, and allocation to and from UTF8; whereas ASCII requires no conversion.
Also using Latin 1 in places where only ASCII is allowed (http headers etc) adds a validation step to and from. There are fast ASCII algorithms that don't apply to Latin 1, e.g case insentive compare (ignore a single bit), upper and lower casing (flip a bit) etc.
e.g. If a string is known to be ASCII (1 byte per char) it requires no validation steps and its bytes can be used directly, and conversion to UTF8 is instant, as its bytes are @already also UTF8.
The reason Mono has disabled fixed on strings is purely to help us locate the code paths that have fixed in as many places as possible and upgrade the code.
The idea is to perform a copy of the data on demand if necessary when a "fixed" that has not been upgraded is found, preserving compatibility at the expense of a string copy.
I once also tried to make the case that .net was ucs2 and not UTF16 in front of the whole ECMA committee, only to find out that I was wrong :-)
@benaadams, I live in Europe and we use Latin-1. Consider the cultural impact of using Latin-1 from my European perspective. For the US ASCII might seem like a slam dunk, but if you think about this with European eyes it's not.
Could a compiler-switch be a good middleground? Use ASCII as default, but open up for Latin1...
@HansOlavS I don't think Latin-1 would be great; I'm not thinking of it from a US perspective (I live in Europe also); more from high throughput interchange formats.
Latin-1 (ISO-8859-1) is also a superset of ASCII but requires conversion to and from String, UTF8 and Windows native. Only 6% of webpages use ISO-8859-1. It also doesn't actually cover all European lanugages as well as ISO-8859 having 16 variants for the fuller coverage.
Related you need to be using ISO-8859-7 (Latin/Greek), ISO-8859-15 (Latin-9) or ISO-8859-16 (Fantasy mix) to include the Euro symbol € as its not included in Latin-1.
The Latin formats don't have the quick processing properties of ASCII e.g. case-intensive compare, uppercasing, lowercasing, and require character translation look ups; so at best become a slightly more compressed format with but with translation penalties.
It might be worth having a string8byte opaque intermediate format; that supports string items such as indexOf, split, etc, then requires translation to String, UTF8 and ASCII (for subset validation).
However, I'm not sure that provides a great advantage over using Span<byte>
or btye[]
and the Encoder
namespace?
Disclaimer: this is way out of my league... ;)
I get that ASCII aligns neatly with a lot of low-level protocols (like HTTP), but my concern was memory footprint and reduction when loading up my .NET app that contains a lot of Latin-1 compatible strings. It would be super-nice to have those stored as 8-bit chars in-memory instead of the 16-bits .NET now uses.
If you're familiar with Delphi (Anders Hejlsbergs predecessor to C#) it had 'string' (which was ANSI string, depending on your computers localization settings or whatever) and 'widestring' which was unicode (not sure of the actual encoding).
Actually, my very first reaction to C# and .NET (reading a C# book back in 2002) was that all strings were "unicode" and 16-bits. I thought that was a strange choice regarding memory consumption.
And regarding special cases like €, I don't care as that is an edge-case and I'm happy to have it stored in UTF16.
ASCII has other benefits, when marshaling out to UTF8 it requires no processing. It is a direct copy.
Anything th
Anything that uses the top bit would prevent this optimization from working. The was one of the reasons for Mono's work to use ascii
I get the ASCII->UTF8 compatibility and marshaling and p/invoke and such. I guess I'm still a bit uneasy with .NET's decision to have all strings in memory as 16-bits UTF16 and miss the old days in Delphi where I could have ANSI (Latin-1) string and widestring and make a conscious decision on where to use what. But maybe this concern is totally beside the point in this thread and I misunderstood.
For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.
The opened issue is talking to compress the strings and waste half of a bytes to be compatible with ASCII does not make sense to me. To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127?
A part from Java, Python has doing a similar thing to the one I was thinking: http://legacy.python.org/dev/peps/pep-0393/
characters that are not representable is UCS2 would use UCS4 (4 Byte).
If we were interested to be compatible with UTF8 we could use UTF8 directly but it is not the most compact representation: a lot of simple characters as 'è' become 3 bytes, the more complex could become 6 bytes long!
I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?
Why don't use separate type?
@HansOlavS It would be possible to use Latin-1 as the compact encoding. Scanning would be just as fast because the second Unicode block (U+0080–U+00FF) corresponds to the upper range of Latin-1, as @fanol rightly pointed out. We could also do some clever hacks with the range U+0080–U+009F, which contains rarely used control characters; for example, encoding € as U+0091 (“Private Use 1”).
Some ASCII-only optimisations would no longer work. Marshalling would require an extra scan per string. And since Latin-1 has incomplete coverage for many languages, it might not make much of a difference; we should measure real-world apps.
I don’t like that this optimisation privileges some languages over others, but frankly it’s because the character encodings are already biased.
@zabulus All the existing APIs are using String
, so if we change the representation of String
then existing code will benefit without modification, as long as it’s not using unsafe
/fixed
. If we add a new type, only new code will be able to take advantage of the optimisation.
@zabulus
Wh(y) don't use separate type?
Sure you can use a separate type, that's what the UTF-8 string library is trying to achieve.
However this proposal was to see if something more wide-ranging and more deeply embedded in the CLR would be possible. With the side-effect that you don't have to make any code-changes to use it.
@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.
I see one more concern about such strings, except unsafe/fixed
is about marshaling such strings with P/Invoke. Specially in case using the new strings in W(ide) versions of Win32 API.
As Latin1 is a subset of UTF16 the marshalling code if should convert to a C/C++ wstring would simply duplicate any byte adding a 0 before it, probably one could made the thing faster using SSE intrinsic. If I'm not wrong the algorith that there work for the ASCII part of an UTF8 string would work to convert a Latin1 string to UFT16 too: https://woboq.com/blog/utf-8-processing-using-simd.html
It is obvious that in some cases this implementation will have performance cost it depends if it is more important to use less memory or to be faster...
For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.
I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?
Characters 128 - 159 differ between the Windows-1252 variant and UTF16
To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127
Then additionally in the fast path algorithms you need to do a validation check, per character, which makes them less fast path... (can't use a straight int, long, vector bitwise or)
Sorry, I guess I didn't actually ask any questions when I proposed this. I guess I'm wondering:
@fanol Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq
+ punpcklbw
). UTF-8 would be slightly more involved.
@mattwarren Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData
API used internally by fixed
is less than ideal.
Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the
get_OffsetToStringData
API used internally byfixed
is less than ideal.
@evincarofautumn that's interesting to know, thanks
Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw).
Though not for the Windows-1252 variant due to the conflict in characters 128 - 159
OK the Microsoft version of Latin1 while more clever in the use of the only 255 characters is not adapt because it is not a subset of UTF16 while the real Latin1 is https://en.wikipedia.org/wiki/ISO/IEC_8859-1 so it is this that should be used. Patience for '€' and other characters in the block 128 - 159 strings containing them will be not compressible.
To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now. What was clear now and has held true for quite sometime is that
Typical apps have 20% of their GC heap as strings. Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.
Now the first issue you have is while this saves GC heap size, it is not guaranteed to have a SPEED improvement. Lets just assume that the new representation is equally good at everything, but half the size. For things that fit in cache with the bigger size, you probably will get very little speed improvement. We know from work comparing 32 vs 64 bit applications, that the GC heap expands by 1.6X when you got to 64 bit (pointers are bigger), and that results in a 10-15% performance loss. Thus we can compute that changing the representation of strings is likely to save 10% of the GC heap and thus is likely to save about 1.5 to 2.5% throughput. This is significant, but not dramatic. Keep in mind this assumes that all operations are equal.
Now it turns out that all operations are NOT equal, and thankfully the advantage is toward the NEW layout (if done well) most of the time. We know that the most common operations on strings include
Comparision Copy/Substring IndexOf Hashing
The good news is that for all of these operations, are SCANS and thus if the encodings are SHORTER they are FASTER (fewer iterations). This bodes well.
What are the disadvantages: As you might suspect compatibility is the big issue. There are two basic issues
So the real thorny issue is this issue of 'fixed'. I believe the semantics of it are solvable (first going forward you introduce a real operator, and for compat you simply recognize certain patterns that are emitted by our existing compilers and simply fail on the others.
However this only solves the correctness issue. The perf issue is still there. Most likely you solve this by introducing a string iterator, and ask people to move to it, and/or use JIT optimization to allow some patterns to simply work well (e.g. foreach over the string using integer indexes).
Because this is not perfect, we are likely to allow some compat flag where you can ask to keep things the old way (UTF16) for those who have performance issues because of it.
This probably requires a bit of an audit of where 'fixed' is being used and determining just how bad the proposed solution above is, so we can determine what the guidance will be for replacing 'fixed' in the important cases at least.
But you can see the broad outline above: Changing strings will have a 10% improvement for most GC heap sizes and is likely to yield a 1.5-2.5% throughput improvement, and maybe more (because the code was heavy in string operations that benefit from the new format), but there will definately be peopel who are impacted (those using fixed, and those with hot interop paths passing strings) The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).
Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trvial but not huge.
Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations (we will make an string iterator class?, exactly what compile magic will we have? ...). Also having microfbenchmarks for important cases of 'fixed' (before and after optimization), as well as a microbenchmark for interop would be useful.
Finally, I note that all of this just lets us CHOOSE what representation we want. We actually have ALOT of flexibility on exactly what the representation is. My recommendation is that allow that to be flexible (it does not have to be UTF8 or ascii with a type code. I am actually partial to at runtime fixed Huffman compression (which allows it to work well on Chinese). However I lets set that aside for right now. We can start with one of the suggestions above and change to something else later with basically no impact.
'
I would love, love, love it if string
were synonymous with IString<T> where T : byte or char or uint or similar
and not String
, that would allow custom implementations of IString
such as StringUtf8
, StringAscii
, StringUtf16
, etc. Of course use of p/invoke would require the signature o use the correct implementation of IString
.
I for one would love to see utf 8 everywhere.
The C# Fixed operator on strings
Um, developers should not assume that characters are the size of a char
anyways, as UTF-16 supports multiple char
characters. In fact, I'd say changing to the suggested model would help developers avoid a whole class of bugs due to very poor assumptions.
As for unsafe
operations, the underlying array should remain accessible regardless of the impl, otherwise we will be throwing an immense amount of performance optimization out with the bath water.
Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations
Using wider types long*
on x64 and int*
on x86 and Vector<ushort>
over strings; which would also perform faster if the string was a more compact type.
Um, developers should not assume that characters are the size of a char anyways, as UTF-16 supports multiple char characters.
Normally a validate to ascii is a step with a fallback to using the Encoding namespace that accepts char*
when not ascii. Equally there is a pressure to keep ascii as byte[] for as long as possible for the performance, but then there is no interop with any function that accepts strings so the type must be widened at some point and converted to string.
An example of widening with validation is the unrolled AsciiUtilities.TryGetAsciiString is Kestrel internals.
@vancem thanks for taking the time to answer my query so comprehensively! There's a lot of trade-offs I hadn't even considered, particularly around the impact on fixed
code.
It's nice to know that someone has been thinking about this and that it may, someday, make it's way into the CLR
but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).
Yeah I can image, for fun I modified a version of the CoreCLR source, so that String
had the following layout:
public sealed class String
{
private int m_stringLength;
private byte m_type;
private char m_firstChar;
....
I struggled to get a "hello world" app to run, I got stuck fixing StringBuilder
. The main problem is that when String
is broken, nothing works, you don't even get meaningful error messages because they rely on String
and/or StringBuilder
working!
A systemic change like this has to be done very incrementally to retain any ability to debug the inevitable failures. It’s a difficult and disruptive feature to implement. But, if I might be so bold, it’s our responsibility as systems programmers and programming language implementors to do exactly this kind of hard work for the sake of others. I can’t name an industry in which an improvement of 10% on a highly valued metric would be considered marginal.
A systemic change like this has to be done very incrementally
Absolutely agree. A very good first step would be to design and implement alternative expansion of fixed
statement for strings that is compatible with this scheme.
Yes, the first steps are to fix the string abstraction so that it is 'complete' (that is there is only a modest amount of code that depends on its layout. That includes dealing with a scheme for dealing with 'fixed' on strings and fixing up runtime/class library code to use the new abstractions. Note that this work does NOT break anything and can be done incrementally. After that abstraction is in place, we are in a much better place to actually change the internal format.
A slight wrinkle with fixed
expansion is: if a 8 bit string; was changed to a 16 bit string during the time when it was fixed
(by writing to the high bytes) it would need reallocating as a 16 bit string when leaving the fixed statement.
My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that. The original data is never modified. You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).
You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).
You can currently; its a pointer to the data with no copy - else fixed would be very slow (doesn't mean you should though).
Obviously trouble lies that way if the string has ever been used as its flags such as isAscii
which is lazily evalated would go out of sync; or has been passed to anything or is an embedded/interned string when it will all go to hell...
While as you say, preventing the mutation of strings using the 'fixed' is not enforced, it is illegal (large parts of the system assume that strings are read-only. including the security system. If we found anyone doing this (within or outside Microsoft), we would tell them to fix their code.
There is also a limit to what we will do for compatibility's sake. Fixing legitimate uses of relatively rare but useful feature is one thing, making broken code work is another.
We should not spend effort trying to make the mutation of string case work. The problem is bad enough without it, and we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).
we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).
Seems fair :smile: And if you are doing it for speed you will likely change you code to work with the new way. Probably would want a string with a byte*
ctor for stackalloc.
Will there be explicit conversions between types in the new string
? string.AsAscii()
? string.AsUtf8()
? string.AsUtf16()
?
Is there a way to know what the type the string is? string.IsAscii()
? string.IsUtf8()
? string.IsUtf16()
?
I don't think char
can be one or two bytes depending where it came from. Will there be a string.ToByteArray()
?
@paulomorgado I think the idea here is to change the implementation details of string
, not the public interface. I don't see why would you need any of those methods (except maybe for debugging, to ensure some specific string is compact).
And char
is always two bytes, so for example, the indexer would be implemented something like this pseudocode:
public char this[int index]
{
get
{
if (type == LATIN1)
return (char)bytePointer[index];
else
return charPointer[index];
}
}
Converting string
to a byte array can already be done using Encoding.ASCII.GetBytes()
. I imagine that method would be optimized to just create a copy of the underlying byte array, when applicable.
@svick, I understood that and I think that's a good idea because it, as usual, will benefit already built applications.
However, if the change is made, future code might rely on knowing the difference.
Web APIs will almost exclusively use UTF-8 (and ASCII for some parts). On the other hand, Windows APIs will use UTF-16. What about mixed applications like Windows applications that use web APIs?
Let's take a feed reader Windows application as an example. All communications with the feed providers will use UTF-8, but all presentation will use UTF-16. Instead of relying on the runtime/framework to do the conversion, the Windows application developer might decide to handle all strings as UTF-16 (because that's what Windows uses) while the developer of the web API might choose to use UTF-8 strings (because that's what the web uses).
Is this such an odd or unusual scenario?
At least initially we would completely hide internal details of the representation from the client of string (thus it will not know if how the string is encoded). This proposal at least initially would be no NOT have multiple representations but to uniformly use UTF8 internally. Now maybe we would take that next step (of having multiple possible representations) but first thing first. First we make it so that we don't HAVE to be UTF16.
No the thread proposal was not to use UTF-8 that would 3 bytes for a lot of characters that in UFT-16 would be represented with 2 but to use ASCII or Latin1 when possible using 1 byte and to use UTF-16 (2 byte) for anything else.
If not what we achieve using UTF-8?
@vancem You mention:
My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that.
I though the whole point of the fixed statement was for better performance, e.g. eliding range checks in hot paths. Copying the entire string to a new char array each time would seem to go against that.
I know I'm a little bit late to the party, but here are my thoughts as someone who has written code for String
:
string(char[])
constructor would get a lot slower.m_type
. This may also prevent String methods from being inlined, or if they are it will introduce more jitted code bloat.I generally agree, though, that the memory savings we can gain from UTF-8 strings are desirable. But perhaps we should take a look in the other direction:
var utf8 = u8"Hello, world!";
to get the memory benefits.
Apps that benefit significantly from using UTF-8 strings in string processing code can do so, the rest of the time they can stick to UTF-16.
@jamesqo
+1. And one more moment. Corefxlab has a low-memory solution for dynamic string construction: System.Text.Formatting. Note that this one covers not only memory consumption for the final strings but allocations and overall performance on operations over strings too (which is much ore important for generic web site/service app).
I don't think there's a reason to keep both.
@jamesqo I'm not sure using Utf8String
would help much:
Regarding the number of conversions:
char[]
, then you have to convert whether you're using compact string
or Utf8String
.string
, then you have to convert if you're using Utf8String
, but not if you're using compact string
.As for the amount of code, a lot of that additional code would still be required even with Utf8String
, just maybe placed somewhere else.
Background
I came across this idea after seeing that Java has implemented it, see JEP 254 for the full details, plus this InfoQ article for some extra info.
I'm aware of the CoreFX Labs effort to implement a UTF-8 string library, but this proposal doesn't require using a different library, it changes the implementation of all strings in the CoreCLR (a bit ambitious I know!!)
Note: this optimisation work because many types of strings in web-based applications can actually be encoded as
ASCII
orISO-8859-1 (Latin-1)
, for instance URLs, HTTP Headers, etc. So even with internationalised applications there are still some possible savings.Implementation
One way of implementing this is to change the internal string layout to the following, with an extra
m_type
field:Assuming that there are no padding implications for adding a field like this?
Another approach is to have an abstract
String
base class and then use either aUTF16
orLatin1
implementation via virtual dispatch, but this may have performance implications.Then the (pseudo) code for the equals method become:
Obviously I know this is not a small undertaking, it would require changes to the
String
andStringBuilder
classes, plus several others. Also the JIT or intrinsics would have to be involved to ensure that all the methods that requires an extraif (type == LATIN1)
check are as fast as possible (length, indexOf, etc).Advantages
Disadvantages
ASCII
orISO-8859-1 (Latin-1)
this will have an overhead for no gain.Research
This feature is not worth doing if it doesn't actually save space for common types of .NET applications. So that this can be measured I wrote HeapStringAnalyser that makes use of CLR MD (see Analysing .NET Memory Dumps with CLR MD for more info), @nickcraver from Stack Overflow was kind enough to run my tool and one of their memory dumps and it gave the following output:
In this case the
String
data could be compressed from ~3,358 MB down to ~1,789 MB, a pretty impressive saving! But this test would have to be repeated on a far wider range of apps to see if the memory savings are repeated.Update: there's a really thorough write-up by @evincarofautumn who's actually implemented this suggestion against the Mono code-base