Compact String implementation

mattwarren commented 8 years ago

Background

I came across this idea after seeing that Java has implemented it, see JEP 254 for the full details, plus this InfoQ article for some extra info.

I'm aware of the CoreFX Labs effort to implement a UTF-8 string library, but this proposal doesn't require using a different library, it changes the implementation of all strings in the CoreCLR (a bit ambitious I know!!)

Note: this optimisation work because many types of strings in web-based applications can actually be encoded as ASCII or ISO-8859-1 (Latin-1), for instance URLs, HTTP Headers, etc. So even with internationalised applications there are still some possible savings.

Implementation

One way of implementing this is to change the internal string layout to the following, with an extra m_type field:

public sealed class String : IComparable, ICloneable, IConvertible, IEnumerable
    , IComparable<String>, IEnumerable<char>, IEquatable<String>
{
    [NonSerialized]private int  m_stringLength;
    [NonSerialized]private byte m_type;
    [NonSerialized]private char m_firstChar;
    ....

Assuming that there are no padding implications for adding a field like this?

Another approach is to have an abstract String base class and then use either a UTF16 or Latin1 implementation via virtual dispatch, but this may have performance implications.

Then the (pseudo) code for the equals method become:

public boolean Equals(string other) 
{
    if (this.type != other.type)
       return false;
    if (type == LATIN1)
        return StringLatin1.Equals(this, other);
    else 
        return StringLatinUTF16.Equals(this, other);
}

Obviously I know this is not a small undertaking, it would require changes to the String and StringBuilder classes, plus several others. Also the JIT or intrinsics would have to be involved to ensure that all the methods that requires an extra if (type == LATIN1) check are as fast as possible (length, indexOf, etc).

Advantages

less memory usage overall (as-per @davidfowl "At the top of every ASP.NET profile… strings! ")
strings become more cache-friendly, which may give better performance
Disadvantages
Breaks COM or P/Invoke interop that relies on the current string format
If very few strings in the application are ASCII or ISO-8859-1 (Latin-1) this will have an overhead for no gain.
Probably something else I haven't thought about ;-)
Research

This feature is not worth doing if it doesn't actually save space for common types of .NET applications. So that this can be measured I wrote HeapStringAnalyser that makes use of CLR MD (see Analysing .NET Memory Dumps with CLR MD for more info), @nickcraver from Stack Overflow was kind enough to run my tool and one of their memory dumps and it gave the following output:

...

Overall 30,703,367 "System.String" objects take up 4,320,235,704 bytes (4,120.10 MB)
Of this underlying byte arrays (as Unicode) take up 3,521,948,162 bytes (3,358.79 MB)
Remaining data (object headers, other fields, etc) is 798,287,542 bytes (761.31 MB), at 26 bytes per object

Actual Encoding that the "System.String" could be stored as (with corresponding data size)
    3,347,868,352 bytes are ASCII
        5,078,902 bytes are ISO-8859-1 (Latin-1)
      169,000,908 bytes are Unicode (UTF-16)
Total: 3,521,948,162 bytes (expected: 3,521,948,162)

Compression Summary:
    1,676,473,627 bytes Compressed (to ISO-8859-1 (Latin-1))
      169,000,908 bytes Uncompressed (as Unicode/UTF-16)
       30,703,367 bytes EXTRA to enable compression (one byte field, per "System.String" object)
Total: 1,876,177,902 bytes, compared to 3,521,948,162 before compression

In this case the String data could be compressed from ~3,358 MB down to ~1,789 MB, a pretty impressive saving! But this test would have to be repeated on a far wider range of apps to see if the memory savings are repeated.

Update: there's a really thorough write-up by @evincarofautumn who's actually implemented this suggestion against the Mono code-base

benaadams commented 8 years ago

Was updated to Ascii rather than ISO-8859-1 in RFC 7230 3.2.4. Field Parsing.

Latin-1 conflicts with Utf8 whereas Ascii is compatible (so you can cross detect between the two)

benaadams commented 8 years ago

Interestingly the current string format does already have an IsAscii flag on it

Though it considers ' and - to not be Ascii :frowning:

However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed statements on strings without special handling (and passing to pinvoke etc)

akoeplinger commented 8 years ago

ASCII Strings is something that @evincarofautumn experimented with a lot for Mono: the proposal is here http://www.mono-project.com/docs/advanced/runtime/docs/ascii-strings/ (and a few more details in the mono-devel list thread about it)

I'm sure he can go into great details about the pros/cons 😄

mattwarren commented 8 years ago

However changing the string to be 8-bit rather than 16-bit based on this flag would likely break fixed statements on strings without special handling (and passing to pinvoke etc)

Hmm, I didn't think about compatibility with code that uses fixed against strings, yeah that's a problem. Looks like the mono experiment solved it by disabling fixed on strings.

benaadams commented 8 years ago

Looks like the mono experiment solved it by disabling fixed on strings.

Could always stackalloc length * 2 bytes for fixed start then copy and widen with _mm_unpacklo_epi8 & _mm_unpackhi_epi8 then compress back at fixed closed. Would be slow though...

Or allow both fixed byte* and fixed char*

fanoI commented 8 years ago

Why use ASCII instead of Latin 1 for the 1 byte version? Latin1 will permit to use one byte for a lot of accented characters of Western Europe as 'è', 'ò', 'ù'...

Could have sense to have a third type at this point for UTF32 in the case of strings containing characters that could not represented in UTF16 (Chinese characters, Klingon, Emoticons, ...) if not resorting to the kludge of use surrogate pairs (that are not really supported in .NET[1])? Probably it would not accomplish a lot of compression in this case but the implementation could be more correct.

[1] De facto .NET String are really UCS2 not UTF16.

benaadams commented 8 years ago

Why use ASCII instead of Latin 1

ASCII and UTF8 are compatible; Latin 1 and UTF8 are not so require conversions between the two.

UTF8 is just as compact for the non-accented characters and is current the general interchange format, so using Latin 1 would add conversion load, and allocation to and from UTF8; whereas ASCII requires no conversion.

Also using Latin 1 in places where only ASCII is allowed (http headers etc) adds a validation step to and from. There are fast ASCII algorithms that don't apply to Latin 1, e.g case insentive compare (ignore a single bit), upper and lower casing (flip a bit) etc.

e.g. If a string is known to be ASCII (1 byte per char) it requires no validation steps and its bytes can be used directly, and conversion to UTF8 is instant, as its bytes are @already also UTF8.

migueldeicaza commented 8 years ago

The reason Mono has disabled fixed on strings is purely to help us locate the code paths that have fixed in as many places as possible and upgrade the code.

The idea is to perform a copy of the data on demand if necessary when a "fixed" that has not been upgraded is found, preserving compatibility at the expense of a string copy.

migueldeicaza commented 8 years ago

I once also tried to make the case that .net was ucs2 and not UTF16 in front of the whole ECMA committee, only to find out that I was wrong :-)

http://www.unicode.org/faq/utf_bom.html#utf16-11

HansOlavS commented 8 years ago

@benaadams, I live in Europe and we use Latin-1. Consider the cultural impact of using Latin-1 from my European perspective. For the US ASCII might seem like a slam dunk, but if you think about this with European eyes it's not.

Could a compiler-switch be a good middleground? Use ASCII as default, but open up for Latin1...

benaadams commented 8 years ago

@HansOlavS I don't think Latin-1 would be great; I'm not thinking of it from a US perspective (I live in Europe also); more from high throughput interchange formats.

String is the correct format for Windows p/invoke, platform internationalisation and UI display.
ASCII is the correct format for a lot of protocols with their limited command set and dislike of getting involved with network byte order and endianness (and a straight subset of String with char narrowing).
UTF8 is now the general standard for most text; both a direct superset of ASCII and endianness free, as well as having a straight forward vector route to UTF16.

Latin-1 (ISO-8859-1) is also a superset of ASCII but requires conversion to and from String, UTF8 and Windows native. Only 6% of webpages use ISO-8859-1. It also doesn't actually cover all European lanugages as well as ISO-8859 having 16 variants for the fuller coverage.

Related you need to be using ISO-8859-7 (Latin/Greek), ISO-8859-15 (Latin-9) or ISO-8859-16 (Fantasy mix) to include the Euro symbol € as its not included in Latin-1.

The Latin formats don't have the quick processing properties of ASCII e.g. case-intensive compare, uppercasing, lowercasing, and require character translation look ups; so at best become a slightly more compressed format with but with translation penalties.

It might be worth having a string8byte opaque intermediate format; that supports string items such as indexOf, split, etc, then requires translation to String, UTF8 and ASCII (for subset validation).

However, I'm not sure that provides a great advantage over using Span<byte> or btye[] and the Encoder namespace?

HansOlavS commented 8 years ago

Disclaimer: this is way out of my league... ;)

I get that ASCII aligns neatly with a lot of low-level protocols (like HTTP), but my concern was memory footprint and reduction when loading up my .NET app that contains a lot of Latin-1 compatible strings. It would be super-nice to have those stored as 8-bit chars in-memory instead of the 16-bits .NET now uses.

If you're familiar with Delphi (Anders Hejlsbergs predecessor to C#) it had 'string' (which was ANSI string, depending on your computers localization settings or whatever) and 'widestring' which was unicode (not sure of the actual encoding).

Actually, my very first reaction to C# and .NET (reading a C# book back in 2002) was that all strings were "unicode" and 16-bits. I thought that was a strange choice regarding memory consumption.

HansOlavS commented 8 years ago

And regarding special cases like €, I don't care as that is an edge-case and I'm happy to have it stored in UTF16.

migueldeicaza commented 8 years ago

ASCII has other benefits, when marshaling out to UTF8 it requires no processing. It is a direct copy.

Anything th

migueldeicaza commented 8 years ago

Anything that uses the top bit would prevent this optimization from working. The was one of the reasons for Mono's work to use ascii

HansOlavS commented 8 years ago

I get the ASCII->UTF8 compatibility and marshaling and p/invoke and such. I guess I'm still a bit uneasy with .NET's decision to have all strings in memory as 16-bits UTF16 and miss the old days in Delphi where I could have ANSI (Latin-1) string and widestring and make a conscious decision on where to use what. But maybe this concern is totally beside the point in this thread and I misunderstood.

fanoI commented 8 years ago

For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.

The opened issue is talking to compress the strings and waste half of a bytes to be compatible with ASCII does not make sense to me. To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127?

A part from Java, Python has doing a similar thing to the one I was thinking: http://legacy.python.org/dev/peps/pep-0393/

characters that are not representable is UCS2 would use UCS4 (4 Byte).

If we were interested to be compatible with UTF8 we could use UTF8 directly but it is not the most compact representation: a lot of simple characters as 'è' become 3 bytes, the more complex could become 6 bytes long!

I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?

zabulus commented 8 years ago

Why don't use separate type?

evincarofautumn commented 8 years ago

@HansOlavS It would be possible to use Latin-1 as the compact encoding. Scanning would be just as fast because the second Unicode block (U+0080–U+00FF) corresponds to the upper range of Latin-1, as @fanol rightly pointed out. We could also do some clever hacks with the range U+0080–U+009F, which contains rarely used control characters; for example, encoding € as U+0091 (“Private Use 1”).

Some ASCII-only optimisations would no longer work. Marshalling would require an extra scan per string. And since Latin-1 has incomplete coverage for many languages, it might not make much of a difference; we should measure real-world apps.

I don’t like that this optimisation privileges some languages over others, but frankly it’s because the character encodings are already biased.

@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.

mattwarren commented 8 years ago

@zabulus

Wh(y) don't use separate type?

Sure you can use a separate type, that's what the UTF-8 string library is trying to achieve.

However this proposal was to see if something more wide-ranging and more deeply embedded in the CLR would be possible. With the side-effect that you don't have to make any code-changes to use it.

zabulus commented 8 years ago

@zabulus All the existing APIs are using String, so if we change the representation of String then existing code will benefit without modification, as long as it’s not using unsafe/fixed. If we add a new type, only new code will be able to take advantage of the optimisation.

I see one more concern about such strings, except unsafe/fixed is about marshaling such strings with P/Invoke. Specially in case using the new strings in W(ide) versions of Win32 API.

fanoI commented 8 years ago

As Latin1 is a subset of UTF16 the marshalling code if should convert to a C/C++ wstring would simply duplicate any byte adding a 0 before it, probably one could made the thing faster using SSE intrinsic. If I'm not wrong the algorith that there work for the ASCII part of an UTF8 string would work to convert a Latin1 string to UFT16 too: https://woboq.com/blog/utf-8-processing-using-simd.html

It is obvious that in some cases this implementation will have performance cost it depends if it is more important to use less memory or to be faster...

benaadams commented 8 years ago

For Latin1 I intended the "proper" implementation done by Microsoft: https://en.wikipedia.org/wiki/Windows-1252.

I was pretty sure that characters between 128 and 255 were the same between Latin1 and UTF16 I am wrong?

Characters 128 - 159 differ between the Windows-1252 variant and UTF16

To check if a Latin1 string is a valid ASCII character you could not simple check if its byte is < 127

Then additionally in the fast path algorithms you need to do a validation check, per character, which makes them less fast path... (can't use a straight int, long, vector bitwise or)

mattwarren commented 8 years ago

Sorry, I guess I didn't actually ask any questions when I proposed this. I guess I'm wondering:

Has something like this been considered before, are there fundamental reasons why it can't be or won't be done in the CLR?
If not, is this something that would be considered in the future or are things like System.Text.Utf8 the recommended solution?

evincarofautumn commented 8 years ago

@fanol Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw). UTF-8 would be slightly more involved.

@mattwarren Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData API used internally by fixed is less than ideal.

mattwarren commented 8 years ago

Around the time we had a discusson on mono-devel-list about my proposal, some MS people (@vancem) indicated that they’ve thought about doing this in the past. The objections were mainly that it’s a large, invasive change (we don’t want to break stability & backward compatibility) and the get_OffsetToStringData API used internally by fixed is less than ideal.

@evincarofautumn that's interesting to know, thanks

benaadams commented 8 years ago

Right, marshalling to UTF-16 would still be cheap. You could do it with ~2 instructions for each 8 bytes of input (movq + punpcklbw).

Though not for the Windows-1252 variant due to the conflict in characters 128 - 159

fanoI commented 8 years ago

OK the Microsoft version of Latin1 while more clever in the use of the only 255 characters is not adapt because it is not a subset of UTF16 while the real Latin1 is https://en.wikipedia.org/wiki/ISO/IEC_8859-1 so it is this that should be used. Patience for '€' and other characters in the block 128 - 159 strings containing them will be not compressible.

vancem commented 8 years ago

To answer @mattwarren question on whether changing the internal representation of a string has been considered before, the short answer is YES. In fact it has been a pet desire of mine for probably over a decade now. What was clear now and has held true for quite sometime is that

Typical apps have 20% of their GC heap as strings. Most of the 16 bit characters have 0 in their upper byte. Thus you can save 10% of typical heaps by encoding in various ways that eliminate these pointless upper bytes.

Now the first issue you have is while this saves GC heap size, it is not guaranteed to have a SPEED improvement. Lets just assume that the new representation is equally good at everything, but half the size. For things that fit in cache with the bigger size, you probably will get very little speed improvement. We know from work comparing 32 vs 64 bit applications, that the GC heap expands by 1.6X when you got to 64 bit (pointers are bigger), and that results in a 10-15% performance loss. Thus we can compute that changing the representation of strings is likely to save 10% of the GC heap and thus is likely to save about 1.5 to 2.5% throughput. This is significant, but not dramatic. Keep in mind this assumes that all operations are equal.

Now it turns out that all operations are NOT equal, and thankfully the advantage is toward the NEW layout (if done well) most of the time. We know that the most common operations on strings include

Comparision Copy/Substring IndexOf Hashing

The good news is that for all of these operations, are SCANS and thus if the encodings are SHORTER they are FASTER (fewer iterations). This bodes well.

What are the disadvantages: As you might suspect compatibility is the big issue. There are two basic issues

Interop - on the boundary from managed to native code (where native code needs UTF16), a copy will need to be made. In general this is not a huge deal as interop has its own overheads and we already have logic for doing this copy if needed. However we do expect that there will be some scenarios where interop with string marshaling will be important. We are effectively trading off this scenario (which I believe is OK for the general case, but there WILL be some set of less-than-happy customers (they will lose out))
The C# Fixed operator on strings. This is a thornier issue. There are two aspects. First is correctness, and the second is performance. You an try to 'solve' the correctness issue by making a copy of the string when you do a 'fixed' Obviously this makes perf bad (and typically the whole point of fixed is perf) but at least is correct. It turns out that this is not as easy as you would like because we don't actually have an 'operation' which takes a string and returns its first character. Instead, we have a declared offset that IL compilers used to do the arithmetic themselves. Because IL code generators may generat this code in many ways, this makes it potentially hard (in the limit impossible) to figure out when to make the copy.

So the real thorny issue is this issue of 'fixed'. I believe the semantics of it are solvable (first going forward you introduce a real operator, and for compat you simply recognize certain patterns that are emitted by our existing compilers and simply fail on the others.

However this only solves the correctness issue. The perf issue is still there. Most likely you solve this by introducing a string iterator, and ask people to move to it, and/or use JIT optimization to allow some patterns to simply work well (e.g. foreach over the string using integer indexes).

Because this is not perfect, we are likely to allow some compat flag where you can ask to keep things the old way (UTF16) for those who have performance issues because of it.

This probably requires a bit of an audit of where 'fixed' is being used and determining just how bad the proposed solution above is, so we can determine what the guidance will be for replacing 'fixed' in the important cases at least.

But you can see the broad outline above: Changing strings will have a 10% improvement for most GC heap sizes and is likely to yield a 1.5-2.5% throughput improvement, and maybe more (because the code was heavy in string operations that benefit from the new format), but there will definately be peopel who are impacted (those using fixed, and those with hot interop paths passing strings) The main challenge is dealing with fixed, but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).

Thus it IS doable, but it is at least moderately expensive (man months), and the payoff is non-trvial but not huge.

Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations (we will make an string iterator class?, exactly what compile magic will we have? ...). Also having microfbenchmarks for important cases of 'fixed' (before and after optimization), as well as a microbenchmark for interop would be useful.

Finally, I note that all of this just lets us CHOOSE what representation we want. We actually have ALOT of flexibility on exactly what the representation is. My recommendation is that allow that to be flexible (it does not have to be UTF8 or ascii with a type code. I am actually partial to at runtime fixed Huffman compression (which allows it to work well on Chinese). However I lets set that aside for right now. We can start with one of the suggestions above and change to something else later with basically no impact.

'

whoisj commented 8 years ago

I would love, love, love it if string were synonymous with IString<T> where T : byte or char or uint or similar and not String, that would allow custom implementations of IString such as StringUtf8, StringAscii, StringUtf16, etc. Of course use of p/invoke would require the signature o use the correct implementation of IString.

I for one would love to see utf 8 everywhere.

whoisj commented 8 years ago

The C# Fixed operator on strings

Um, developers should not assume that characters are the size of a char anyways, as UTF-16 supports multiple char characters. In fact, I'd say changing to the suggested model would help developers avoid a whole class of bugs due to very poor assumptions.

As for unsafe operations, the underlying array should remain accessible regardless of the impl, otherwise we will be throwing an immense amount of performance optimization out with the bath water.

benaadams commented 8 years ago

Given that, I think the next steps would be to do a good inventory of where 'fixed' is used and design the mitigations

Using wider types long* on x64 and int* on x86 and Vector<ushort> over strings; which would also perform faster if the string was a more compact type.

Um, developers should not assume that characters are the size of a char anyways, as UTF-16 supports multiple char characters.

Normally a validate to ascii is a step with a fallback to using the Encoding namespace that accepts char* when not ascii. Equally there is a pressure to keep ascii as byte[] for as long as possible for the performance, but then there is no interop with any function that accepts strings so the type must be widened at some point and converted to string.

An example of widening with validation is the unrolled AsciiUtilities.TryGetAsciiString is Kestrel internals.

mattwarren commented 8 years ago

@vancem thanks for taking the time to answer my query so comprehensively! There's a lot of trade-offs I hadn't even considered, particularly around the impact on fixed code.

It's nice to know that someone has been thinking about this and that it may, someday, make it's way into the CLR

but there is also frankly at least a few man-months of simply dealing with the places in the runtime where we took a dependency on the layout of string (in the runtime, interop, and things like stringbuilder, and all the uses of 'fixed' in corefx).

Yeah I can image, for fun I modified a version of the CoreCLR source, so that String had the following layout:

public sealed class String 
{
    private int  m_stringLength;
    private byte m_type;
    private char m_firstChar;
    ....

I struggled to get a "hello world" app to run, I got stuck fixing StringBuilder. The main problem is that when String is broken, nothing works, you don't even get meaningful error messages because they rely on String and/or StringBuilder working!

evincarofautumn commented 8 years ago

A systemic change like this has to be done very incrementally to retain any ability to debug the inevitable failures. It’s a difficult and disruptive feature to implement. But, if I might be so bold, it’s our responsibility as systems programmers and programming language implementors to do exactly this kind of hard work for the sake of others. I can’t name an industry in which an improvement of 10% on a highly valued metric would be considered marginal.

jkotas commented 8 years ago

A systemic change like this has to be done very incrementally

Absolutely agree. A very good first step would be to design and implement alternative expansion of fixed statement for strings that is compatible with this scheme.

vancem commented 8 years ago

Yes, the first steps are to fix the string abstraction so that it is 'complete' (that is there is only a modest amount of code that depends on its layout. That includes dealing with a scheme for dealing with 'fixed' on strings and fixing up runtime/class library code to use the new abstractions. Note that this work does NOT break anything and can be done incrementally. After that abstraction is in place, we are in a much better place to actually change the internal format.

benaadams commented 8 years ago

A slight wrinkle with fixed expansion is: if a 8 bit string; was changed to a 16 bit string during the time when it was fixed (by writing to the high bytes) it would need reallocating as a 16 bit string when leaving the fixed statement.

vancem commented 8 years ago

My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that. The original data is never modified. You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).

benaadams commented 8 years ago

You are not allowed to modify string data in using the pointer a 'fixed' statement gives you (thankfully).

You can currently; its a pointer to the data with no copy - else fixed would be very slow (doesn't mean you should though).

Obviously trouble lies that way if the string has ever been used as its flags such as isAscii which is lazily evalated would go out of sync; or has been passed to anything or is an embedded/interned string when it will all go to hell...

vancem commented 8 years ago

While as you say, preventing the mutation of strings using the 'fixed' is not enforced, it is illegal (large parts of the system assume that strings are read-only. including the security system. If we found anyone doing this (within or outside Microsoft), we would tell them to fix their code.

There is also a limit to what we will do for compatibility's sake. Fixing legitimate uses of relatively rare but useful feature is one thing, making broken code work is another.

We should not spend effort trying to make the mutation of string case work. The problem is bad enough without it, and we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).

benaadams commented 8 years ago

we will always have the 'compat hammer' of allowing users with evil code to use the old string layout (globally).

Seems fair :smile: And if you are doing it for speed you will likely change you code to work with the new way. Probably would want a string with a byte* ctor for stackalloc.

paulomorgado commented 8 years ago

Will there be explicit conversions between types in the new string? string.AsAscii()? string.AsUtf8()? string.AsUtf16()?

Is there a way to know what the type the string is? string.IsAscii()? string.IsUtf8()? string.IsUtf16()?

I don't think char can be one or two bytes depending where it came from. Will there be a string.ToByteArray()?

svick commented 8 years ago

@paulomorgado I think the idea here is to change the implementation details of string, not the public interface. I don't see why would you need any of those methods (except maybe for debugging, to ensure some specific string is compact).

And char is always two bytes, so for example, the indexer would be implemented something like this pseudocode:

public char this[int index]
{
    get
    {
        if (type == LATIN1)
            return (char)bytePointer[index];
        else 
            return charPointer[index];
    }
}

Converting string to a byte array can already be done using Encoding.ASCII.GetBytes(). I imagine that method would be optimized to just create a copy of the underlying byte array, when applicable.

paulomorgado commented 8 years ago

@svick, I understood that and I think that's a good idea because it, as usual, will benefit already built applications.

However, if the change is made, future code might rely on knowing the difference.

Web APIs will almost exclusively use UTF-8 (and ASCII for some parts). On the other hand, Windows APIs will use UTF-16. What about mixed applications like Windows applications that use web APIs?

Let's take a feed reader Windows application as an example. All communications with the feed providers will use UTF-8, but all presentation will use UTF-16. Instead of relying on the runtime/framework to do the conversion, the Windows application developer might decide to handle all strings as UTF-16 (because that's what Windows uses) while the developer of the web API might choose to use UTF-8 strings (because that's what the web uses).

Is this such an odd or unusual scenario?

vancem commented 8 years ago

At least initially we would completely hide internal details of the representation from the client of string (thus it will not know if how the string is encoded). This proposal at least initially would be no NOT have multiple representations but to uniformly use UTF8 internally. Now maybe we would take that next step (of having multiple possible representations) but first thing first. First we make it so that we don't HAVE to be UTF16.

fanoI commented 8 years ago

No the thread proposal was not to use UTF-8 that would 3 bytes for a lot of characters that in UFT-16 would be represented with 2 but to use ASCII or Latin1 when possible using 1 byte and to use UTF-16 (2 byte) for anything else.

If not what we achieve using UTF-8?

A lot of trivial string will occupy more memory that what they do now
Operator [] is never a simply operation
Conversion to Char (that will remain UTF-16) will be never trivial

jamesqo commented 8 years ago

@vancem You mention:

My expectation of how 'fixed' works is that it unconditionally copies the string data into an array of char and points it at that.

I though the whole point of the fixed statement was for better performance, e.g. eliding range checks in hot paths. Copying the entire string to a new char array each time would seem to go against that.

jamesqo commented 8 years ago

I know I'm a little bit late to the party, but here are my thoughts as someone who has written code for String:

This would slow down existing codepaths that convert from other UTF-16 types to String. For example, the string(char[]) constructor would get a lot slower.
It would require reworking most or all of the Encoding implementations, and double the amount of code for String. Subtle bugs could show up in the UTF-16 implementation of methods, and they would be harder to spot.
Even if there are memory savings from introducing this change, it may still slow down CPU-bound code that processes strings because of an additional switch on m_type. This may also prevent String methods from being inlined, or if they are it will introduce more jitted code bloat.

I generally agree, though, that the memory savings we can gain from UTF-8 strings are desirable. But perhaps we should take a look in the other direction:

Make UTF-8 strings easier to use. We can add built-in compiler support so people can just write

var utf8 = u8"Hello, world!";

to get the memory benefits.

Make Utf8String a built-in type in mscorlib, so people don't have to reference an external library.

Apps that benefit significantly from using UTF-8 strings in string processing code can do so, the rest of the time they can stick to UTF-16.

ig-sinicyn commented 8 years ago

@jamesqo

+1. And one more moment. Corefxlab has a low-memory solution for dynamic string construction: System.Text.Formatting. Note that this one covers not only memory consumption for the final strings but allocations and overall performance on operations over strings too (which is much ore important for generic web site/service app).

I don't think there's a reason to keep both.

svick commented 8 years ago

@jamesqo I'm not sure using Utf8String would help much:

Regarding the number of conversions:

If you have char[], then you have to convert whether you're using compact string or Utf8String.
If you using a library that takes a string, then you have to convert if you're using Utf8String, but not if you're using compact string.

As for the amount of code, a lot of that additional code would still be required even with Utf8String, just maybe placed somewhere else.

dotnet / runtime