Use UTF-8 internally for strings.

I'm sure this has been discussed elsewhere and I don't know when, or if, we'll have time to implement it, but I think it's worth adding here.

Currently Python strs (PyUnicodeObject in C) are implemented as arrays of 1, 2 or 4 byte unicode code points, plus a header I propose that we implement them as a utf8 encoded array of bytes, plus a header.

The advantages are numerous:

Non-ASCII strings will generally be more compact, without making ASCII strings any bigger.
Strings can be joined easily by simple concatenation of the data
The internet is utf8, so the vast majority of encoding and decoding operations should be fast
There need only be one implmentation of each str method and each PyUnicode_ C function, saving considerable code size and simplifying the code
Algorithms for fast operations on utf8 are well known, so many operations will fast, despite the variable length encoding.
The C struct is clearer, as we don't need an awkward union of uint8_t, uint16_t and uint32_t for character data.

However there are two problems:

Indexing into strings is supposed to be O(1).
Some of the C API exposes the internal encoding

Keeping indexing O(1)

To maintain this properly we will have to lazily create an offset table for larger, non-ASCII strings. This is won't be a problem in practice because:

Creating the index is no more expensive than creating the current UCS1/2/4 strings.
We only allocate indexes if we need them, which should be relatively rare.

The C API

We will have to deprecate, and then remove, the C API that exposes the implementation details. We should probably deprecate for 3.14, so that we can implement utf8 strings in 3.16, allowing a proper deprecation period.

Implementation

We will probably want to embed the string data directly into the object, so the struct will look something like this:

typedef struct {
    PyObject_HEAD
    uintptr_t interned: 2;   
    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   
    uintptr_t length: (WORD_SIZE-4);    /* Number of code points in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PyUnicodeObject;

The valid_utf8 bit helps fast encoding. It is false if the string contains half surrogate pairs or any other code point not allowed in legal utf-8.

Indexing operations

setitem(self, index) would be implemented something like this:

def getitem(self, index):
     if self.ascii:
         return self.data[index]
    if self.index is NULL:
        self.index = make_index(self)
    offset = offset_from_index(self.index, index)
    return read_one_char(self.data, offset)

The index table would be composed of len(s)/64 entries, each entry being:

struct index_entry {
     uintptr_t base_offset;
     uint8_t additional_offset;
};

With the offset being computed as base_offset[index/64] + additional_offset[index%64].

We will also need the length in bytes, with adds another uintptr_t entry, which I forgot.

typedef struct {
    PyObject_HEAD
    uintptr_t interned: 2;   
    uintptr_t ascii: 1;   
    uintptr_t valid_utf8: 1;   
    uintptr_t length: (WORD_SIZE-4);    /* Number of code points in the string */
    uintptr_t byte_count;      /* Number of bytes in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PyUnicodeObject;

On 64 bit machines that's 48 bytes of header, compared to the current 40 bytes, unfortunately.

Given that for most strings len(s) is much less than the maximum value, we can reduce this to 32 bytes for ASCII strings and 40 bytes for most other strings:

typedef struct {
    PyObject_HEAD
    uint32_t is_small: 1;   
    uint32_t is_ascii: 1;   
    uint32_t interned: 2;   
    uint32_t ascii: 1;   
    uint32_t valid_utf8: 1;   
    uint32_t small_length: 24;    /* Number of code points in the string */
    uint32_t small_count;      /* Number of bytes in the string */
    Py_hash_t hash;             /* Hash value; -1 if not set */
} PyUnicodeObjectBase;

typedef struct {
    PyUnicodeObjectBase base;
    uint8_t data[1];
} PySmallAsciiObject;

typedef struct {
    PyUnicodeObjectBase base;
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint8_t data[1];
} PySmallUnicodeObject;

typedef struct {
    PyUnicodeObjectBase base;
    PyUnicodeIndex *index;    /* NULL unless needed */
    uint64_t large_length;    /* Number of code points in the string */
    uint64_t large_count;      /* Number of bytes in the string */
    uint8_t data[1];
} PyLargeUnicodeObject;

typedef union { PySmallAsciiObject ascii; PySmallUnicodeObject small; PyLargeUnicodeObject large } PyUnicodeObject;

This makes the code more complex, but still considerably simpler than what we have now.

It does seem that the world at large has come around to UTF-8, with fixed-size encodings feeling more like legacy at this point. See UTF-8 Everywhere.

The relevant PEP 393 provides background for the current state of Python strings.

The effects on the C API and the runtime/memory semantic changes seem big enough this should also have a PEP, IMHO.

PyPy has been using this approach since a number of years and we are indeed very happy with it (we even have zero-copy utf-8 decoding of bytes, at the cost of an extra indirection from our equivalent PyUnicode structure to the actual string data).

One of the most annoying parts in this approach is actually str.find for non-ascii strings. The bytes string find algorithm can be used just fine one the underlying bytes. However, the resulting byte-index needs to be converted to a codepoint-index, which is the other direction of what the index structure speeds up. What we do there is a conversion in O(log n) time using the index and a bit of local counting. A similar problem exists in our regex engine, but I don't know enough about that of CPython to know whether that would affect you.

I've been vaguely considering to rewrite the string find algorithm to also count the codepoints it skips over, but haven't gotten around to that yet.

(the JIT actually helps us here sometimes, because it knows that the functions byte_index_to_codepoint_index and codepoint_index_to_bytecode_index are inverses from each other. This helps if after a str.find call the result is then used to index into the string. but that is very hard to transfer to CPython, I think.)

Indexing operations

setitem(self, index) would be implemented something like this:
def getitem(self, index):
     if self.ascii:
         return self.data[index]
    if self.index is NULL:
        self.index = make_index(self)
    offset = offset_from_index(self.index, index)
    return read_one_char(self.data, offset)
The index table would be composed of len(s)/64 entries, each entry being:
struct index_entry {
     uintptr_t base_offset;
     uint8_t additional_offset;
};
With the offset being computed as base_offset[index/64] + additional_offset[index%64].

Can you elaborate on this more? This scheme seems clever, but after staring at it a while I still have no idea how it is supposed to work.

I'm assuming that you meant index_entries[index/64].base_offset + index_entries[index%64].additional_offset. But then "len(s)/64" entries seems wrong... the rest of the scheme seems to imply that there are at least max(len(s)/64, min(len(s), 64)) entries. Or maybe the struct is wrong and there is one array of len(s)/64 base offsets and another array of min(len(s), 64) additional offsets.

Either way, it seems like there are offsets that are impossible to encode... for example, a string with 64 one-byte characters followed by a bunch of two-byte characters, I think?

faster-cpython / ideas