PEP 393 - Flexible String Representation

Might be a pain, but instead of diverging from the CPython implementation throughout the codebase to add special handling just because our strings are UTF-16, it seems like it would be preferable to introduce a new type PythonString (or something) which behaves a bit more like Python strings. This class would basically be a wrapper for System.String, but when created with a wide unicode character (e.g. \U00010000) the internal representation would switch from a string to a uint array or something of the sort.

class PythonString {
    string utf16;
    uint[] utf32;
}

We could have implicit casts to/from string for compat (obviously not cheap if the underlying string is utf-32 but this could be cached in the instance). APIs where it matters should be updated to use this new type.

Originally posted by @slozier in https://github.com/IronLanguages/ironpython3/pull/680#issuecomment-554691502

The idea of an UTF-32 aware string type seems interesting, but it comes with its own challenges. If it means introducing a separate type PythonString next to the standard str, it will not address the compatibility issues in Python stdlib, 3rd party libraries, or existing Python code. If it means making the current str type handling non-BMP characters transparently by internally switching to a UTF-32 representation (perhaps like int is handling large integers), it can create considerable strain on performance during .NET interoperability, as every string returned from a non-Python .NET call will have to be scanned for surrogate pairs. It is not clear to me which option you were suggesting.

Originally posted by @BCSharp in https://github.com/IronLanguages/ironpython3/pull/680#issuecomment-554702661

Transparently switching between String/PythonString would be an option (assuming I can figure out how to do it, still need to do that for int). But what I was suggesting was more like using a C# PythonString type as the backing type for the Python str type instead of System.String. However, This would make any string generated in Python code behave like CPython strings, but you it would complicate interop with .NET a bit. Here is the sort of behavior I had in mind:

var s = "abc\ud800\udc00";
var a = (PythonString)s; // no surrogate scanning just wraps the string
Debug.Assert(s == a);
var b = PythonString.FromUtf16(s); // scans for surrogate pairs and converts
Debug.Assert(s != b);
var c = new PythonString(s); // not sure which behavior.
var d = (string)a; // backing field is already a string so just unwrap
Debug.Assert(ReferenceEqual(s, d));
var e = (string)b; // backing field is utf32 so we have to convert
Debug.Assert(s == e);

s = "abc\U00010000"
type(s) == str # backing .NET type is PythonString
assert len(s) == 4
# ... everything works like CPython on the Python side of things

import clr
t = System.String(s) # uses the implicit cast so it gets converted
assert type(s) != str # unless we seamlessly switch backing store in which case it would be equal
assert t == "abc\ud800\udc00"

I don't know, I'm just throwing around ideas at this point...

Originally posted by @slozier in https://github.com/IronLanguages/ironpython3/pull/680#issuecomment-554748475

The option of transparently switching between String/PythonString is very appealing, especially that once it is working for int, the same mechanism can be used for string. But I do not know enough about IronPython internals to come with anything concrete at this time. For the option of using PythonString as backing for str, I have the following thoughts:

I'd add a flag field to PythonString, say, bool _utf16validated (not sure about the name). It is set to true after the utf16 string has been scanned for surrogates and none are found. In such case, the utf16 field can safely be used as Python string. If utf16 contains any surrogates, then the utf23 array is used. Further, I would keep all conversion logic as private implementation of PythonString. If the code outside had to track when a string has to be converted to/from UTF-16, it is easy to introduce bugs, especially that the interface surface between IronPython and .NET is so large. Here is the behaviour of PythonString I have in mind:

var s = "abc\ud800\udc00";
var ps = new uint[] { (uint)'a', (uint)'b', (uint)'c', 0x10000u };
var a = (PythonString)s; // no surrogate scanning just sets utf16 to s; 
             // _utf16validated is false
Debug.Assert(s == a); // utf16 field accessed
PythonString b = s; // same as above but with implicit cast, 
        // since the wrapping does not involve any processing
var c = new PythonString(s); // same as above
// PythonString.FromUtf16(s) does not exist, 
// scanning for surrogate pairs and conversion happens on demand
var d = new PythonString(ps); // no scanning, sets utf32 to ps, utf16 is null
Debug.Assert(d[0] == (uint)'a'); // utf32 array not null so it is accessed
Debug.Assert(d[3] == 0x10000u); // utf32 array not null so it is accessed, 
                                // retrieving an non-BMP character
Debug.Assert(a[0] == (uint)'a'); // utf32 array is null and _utf16validated is false, 
                             // so utf16 is scanned; 
                 // since it contains surrogates, utf32 version is constructed 
                 // with 'surrogatepass' error handler, then accessed
Debug.Assert(a[3] == 0x10000u); // utf32 array not null anymore so it is accessed, 
                                // retrieving an non-BMP character
var e = (string)a; // backing field is already a string so just unwrap
Debug.Assert(ReferenceEquals(s, e));
var f = (string)d; // backing field is utf32 so we have to convert, 
                   // result cached in utf16
Debug.Assert(s == f);
var g = PythonString(new uint[] { (uint)'a', (uint)'b', (uint)'c', 0xd800u, 0xdc00u });
                        // above: surrogate pair in utf32, utf16 null
Debug.Assert(g[3], 0xD800u); // utf32 not null hence indexed
Debug.Assert(g != d); // both have utf32 arrays which are compared
Debug.Assert(g != b); // b.utf32 is null, so scanning and conversion to UTF-32 happens,
                      // then utf32 compared
var h = PythonString("abcd"); // stored in utf16, no scanning
Debug.Assert(h[3] == 'd'); // first access triggers scan, no surrogates detected, 
            // _utf16validated set to true, utf32 NOT created, 
            // utf16 indexed and cast to uint
var i = PythonString("efgh"); // stored in utf16, no scanning
Debug.Assert(h != i); // first access to i triggers scan, no surrogates detected, 
          // _utf16validated set to true, utf32 NOT created, 
          // utf16 from both sides compared

Python:

# ... everything from your example except last line,  plus:
assert t == s  # t promoted to str for comparison
assert t != "abc\ud800\udc00"
assert t == System.String("abc\ud800\udc00")  # t maintains its type

Meaning of _utf16validated:

utf32	utf16	_utf16validated	behaviour
not null	null	don't care	operations performed on utf32, if utf16 needed for interop, conversion triggered
null	not null	false	if utf16 requested for interop, accessed, otherwise scanned and converted if necessary
null	not null	true	scanning already took place and no surrogates found, operations done on utf16 only
not null	not null	false	if utf16 requested for interop, accessed, otherwise utf32 used
not null	not null	true	scanning already took place and no surrogates found, operations done on utf16 or utf32, whichever preferred for the situation (are equivalent)

Results of scanning/conversion always cached, so that subsequent operations are fast.

When a string literal is processed in Python, it is stored in utf16 if no surrogates present, otherwise stored in utf32.

Originally posted by @BCSharp in https://github.com/IronLanguages/ironpython3/pull/680#issuecomment-554853999

IronLanguages / ironpython3

PEP 393 - Flexible String Representation #252