Add PyUnicodeWriter API

vstinner commented 1 month ago

I propose adding a new public PyUnicodeWriter API:

typedef struct PyUnicodeWriter PyUnicodeWriter; // members are private

PyAPI_FUNC(PyUnicodeWriter*) PyUnicodeWriter_Create(Py_ssize_t length);

// Destroy the writer
PyAPI_FUNC(void) PyUnicodeWriter_Discard(PyUnicodeWriter *writer);

// Create the final string and destroy the writer
PyAPI_FUNC(PyObject*) PyUnicodeWriter_Finish(PyUnicodeWriter *writer);

// Write a single character
PyAPI_FUNC(int) PyUnicodeWriter_WriteChar(
    PyUnicodeWriter *writer,
    Py_UCS4 ch);

// Decode a string from UTF-8
PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8(
    PyUnicodeWriter *writer,
    const char *str,  // decoded from UTF-8
    Py_ssize_t len);  // use strlen() if len < 0

// Similar to PyUnicode_FromFormat(format, ...)
PyAPI_FUNC(int) PyUnicodeWriter_Format(
    PyUnicodeWriter *writer,
    const char *format,
    ...);

// Write str(obj)
PyAPI_FUNC(int) PyUnicodeWriter_WriteStr(
    PyUnicodeWriter *writer,
    PyObject *obj);

// Write repr(obj)
PyAPI_FUNC(int) PyUnicodeWriter_WriteRepr(
    PyUnicodeWriter *writer,
    PyObject *obj);

// Write str[start:end]
PyAPI_FUNC(int) PyUnicodeWriter_WriteSubstring(
    PyUnicodeWriter *writer,
    PyObject *str,
    Py_ssize_t start,
    Py_ssize_t end);

This API is based on the existing private _PyUnicodeWriter API which exists for 12 years in Python. The private API is used by 5 projects on PyPI top 7,500 projects:

Cython (3.0.9)
asyncpg (0.29.0)
catboost (1.2.3)
frozendict (2.4.0)
immutables (0.20)

Example of usage:

static PyObject *
union_repr(PyObject *self)
{
    unionobject *alias = (unionobject *)self;
    Py_ssize_t len = PyTuple_GET_SIZE(alias->args);

    PyUnicodeWriter *writer = PyUnicodeWriter_Create();
    if (writer == NULL) {
        return NULL;
    }

    for (Py_ssize_t i = 0; i < len; i++) {
        if (i > 0 && PyUnicodeWriter_WriteUTF8(writer, " | ", 3) < 0) {
            goto error;
        }
        PyObject *p = PyTuple_GET_ITEM(alias->args, i);
        if (PyUnicodeWriter_WriteRepr(writer, p) < 0) {
            goto error;
        }
    }
    return PyUnicodeWriter_Finish(writer);

error:
    PyUnicodeWriter_Discard(writer);
    return NULL;
}

vstinner commented 1 month ago

The API is open for bikeshedding :-) See also the issue for different proposed names. I'm not sure about PyUnicodeWriter_Finish() name.

@encukou proposed PyUnicodeWriter_WriteUCS4Char() for PyUnicodeWriter_WriteChar(). I prefer the shorter name, there is no plan to add other WriteChar() functions, since Py_UCS4 covers the whole Unicode Character Set.

vstinner commented 1 month ago

Oh, an important note, this new API leaves the existing private API unchanged. Once a decision will be taken on this public API, I plan to open a new issue to discuss removing the private API. I prefer to discuss it separately.

vstinner commented 1 month ago

cc @gvanrossum @zooba

zooba commented 1 month ago

I like this proposal in general, there are a few simplifications to the API we should probably make, but overall I think this is a good one for us to expose.

I think all functions should return a status code (i.e. no voids)
agreed with calling it WriteChar() and always taking UCS4 (just like the C library functions that take int)
Finish() is fine to me - something like GetString() would suggest you still have to destroy it afterwards
how important is SetOverallocate? What happens if we just always overallocate? What happens if you don't set it? Is this really just setting the capacity? Or it is a flag that affects overallocation behaviour?
I would personally like a WriteUCS2(wchar_t *s, ...) as well, but would survive calling Format("%w", s) (or whatever the code is)
I like this for a Limited API, but we'd need a varargs alternative - how would that look, and should we design it now?

davidhewitt commented 1 month ago

I saw this and have a passing thought it's interesting to me because it might be a way to build Unicode objects using Rust formatting machinery while avoiding an allocation on the Rust side.

But I don't know if the performance of all the FFI calls would be worse overhead than the allocation (probably depends on use case).

So while Rust/PyO3 might be a consumer of this API, my intuition is that the most efficient thing we could ever do would be to allocate a utf8 buffer on the Rust side and transfer ownership of that into a Unicode objects via a pointer plus a function pointer for the destructor (but I know that would require a lot of work inside the interpreter to support such a thing).

All to say this caught my eye and I'm not sure yet how much it might be used by Rust 😁

vstinner commented 1 month ago

I think all functions should return a status code (i.e. no voids)

PyUnicodeWriter_SetOverallocate() only sets an internal flag, why should it be able to report an error?

how important is SetOverallocate? What happens if we just always overallocate? What happens if you don't set it? Is this really just setting the capacity? Or it is a flag that affects overallocation behaviour?

Controlling overallocation is a key for performance: you should disable overallocation before the last write.

Example (pseudo-code without error checking):

writer = PyUnicodeWriter_Create();
PyUnicodeWriter_WriteChar(writer, "x");
PyUnicodeWriter_WriteUTF8(writer, <very long string>);
return PyUnicodeWriter_Finish(writer);

The WriteUTF8() call is inefficient since the very long string will be overallocated, and then truncated by Finish(). It's better to disable overallocation:

writer = PyUnicodeWriter_Create();
PyUnicodeWriter_WriteChar(writer, "x");
PyUnicodeWriter_SetOverallocate(writer, 0);
PyUnicodeWriter_WriteUTF8(writer, <very long string>);
return PyUnicodeWriter_Finish(writer);

I would personally like a WriteUCS2(wchar_t *s, ...) as well, but would survive calling Format("%w", s) (or whatever the code is)

We can extend the API later, but I propose a minimalist API to make it more likely to be accepted :-)

I like this for a Limited API, but we'd need a varargs alternative - how would that look, and should we design it now?

PyUnicodeWriter_Format() is mostly a convenient API, it's possible to avoid it with more code.

zooba commented 1 month ago

We can extend the API later, but I propose a minimalist API to make it more likely to be accepted :-)

There's a very fine line between "minimalist" and "incomplete". Given we have wide-char APIs for Unicode already, this gap feels incomplete, and the overallocation flag API feels non-minimalist (to me, at least).

What do you think about having _Create() take an estimated size? Allocate the internal buffer to that size, and then no reallocations needed until the end (unless it's exceeded). Usually a realloc to make the buffer smaller is going to be cheap, right? It's only when you grow the buffer that it may need to be copied.

For really performance intensive situations I'd do my own write into a preallocated array (reused for each loop iteration, because if there's no loop then it's not "intensive"), and then just use PyUnicode_FromString at the end. But we're aiming for a combination of convenience and performance here, not too extreme in either direction. (The scenario I'm thinking of is when I want to convert an array of native strings into a single char-separated Python string. I think most recently I allocated an array myself for this, but I've definitely made a list and called join before. This proposed API would be better.)

PyUnicodeWriter_Format() is mostly a convenient API, it's possible to avoid it with more code.

Yes, but that other code likely involves other varargs APIs, or worse, APIs that rely on C's macros for varargs handling ;-) Perhaps there's a nice API we could have here to make something like _WriteStr() that passes through a NULL or consumes a reference so that _Write???(PyLong_FromLong(value)) only needs one error check? Otherwise code length is going to multiply out very quickly (but perhaps the non-C scenarios I'm thinking of here need to use all that code anyway and nothing would be saved?)

vstinner commented 4 weeks ago

For really performance intensive situations I'd do my own write into a preallocated array (reused for each loop iteration, because if there's no loop then it's not "intensive"), and then just use PyUnicode_FromString at the end. But we're aiming for a combination of convenience and performance here, not too extreme in either direction. (The scenario I'm thinking of is when I want to convert an array of native strings into a single char-separated Python string. I think most recently I allocated an array myself for this, but I've definitely made a list and called join before. This proposed API would be better.)

Performance still remains a mystery to me :-) Recently, I tried to micro-optimize PyUnicode_FromFormat("%R", str_obj) (for Python str object) using a private _PyUnicodeWriter internally. It was 10 to 30% slower. I didn't make sense to me, but well, I trust benchmark numbers :-)

In some cases, creating a temporary string and then write the string into _PyUnicodeWriter is faster than writing directly into _PyUnicodeWriter buffer. These operations are around 100 ns, it's very fast, and any abstraction layer (_PyUnicodeWriter) can make the code slower.

Proposed public PyUnicodeWriter API is not only about performance. It's designed for the limited C API which has no API to write characters to build a Python str object. It only has functions to decode strings from another format. See also https://github.com/python/cpython/issues/119609 : there is no API in the limited API to create a string from a UCS-4 buffer for example (@serhiy-storchaka suggests using PyUnicode_DecodeUTF32()).

vstinner commented 4 weeks ago

I would personally like a WriteUCS2(wchar_t *s, ...) as well, but would survive calling Format("%w", s) (or whatever the code is)

UCS-2 is different than Windows "UTF-16" wchar_t strings. UCS-2 is for strings limited to the BMP range: [U+0000; U+ffff]. Whereas Windows "UTF-16" wchar_t strings use surrogate pairs.

For Windows "UTF-16" wchar_t strings, you can reuse PyUnicode_FromWideChar(). That's what I mean by: "you can reuse the existing API" when PyUnicodeWriter_Format() is not available. Internally, PyUnicodeWriter_Format() is likely doing the same: create a temporary string and then write it into the writer.

vstinner commented 3 weeks ago

cc @encukou @erlend-aasland

erlend-aasland commented 3 weeks ago

I like Steve's suggestion about _Create taking an estimated size. The _SetOverallocate API feels unergonomic.

vstinner commented 3 weeks ago

I like Steve's suggestion about _Create taking an estimated size. The _SetOverallocate API feels unergonomic.

Ok, I updated the API:

I added an initial length parameter to PyUnicodeWriter_Create().
I removed the PyUnicodeWriter_SetOverallocate() function.

zooba commented 3 weeks ago

UCS-2 is different than Windows "UTF-16" wchar_t strings. UCS-2 is for strings limited to the BMP range: [U+0000; U+ffff]. Whereas Windows "UTF-16" wchar_t strings use surrogate pairs.

Trust me, I'm well aware of this 😉 I used UCS-2 specifically because it means we can ignore surrogate pairs and pass them through (which is what Windows does, despite claiming "UTF-16"), but if you prefer to properly decode then that's fine by me.

I just know that the vast majority of my cases will involve wchar_t * strings, so selfishly, I'd like to not have to write the conversion myself or have to use _Format(...) in all cases. But it's a selfish ask, so if you don't like it, don't do it and I'll live.

serhiy-storchaka commented 3 weeks ago

Here is how memory management works in many performant codecs:

We estimate the upper size of the output data and allocate the str or bytes object of the specified size.
Normally, bytes/characters are translated without checking for buffer overflow.
If a UnicodeError occurs, the error handler is called, which can return a replacement string. The replacement string can need resizing the output buffer. The new size is calculated taking into account the current positions in the input string and the output buffer. If previously translated characters consumed less than expected, the new size can be less than the old size, so no resize is needed.
Finally the output buffer is resized to the actual size.

Therefore, we need the ability not only to specify the initial size of the buffer, but also to change it during the writing process. It is more convenient to specify a minimum reserved size in addition to already written characters than an absolute buffer size.

vstinner commented 3 weeks ago

I'm not against adding an API for wchar_t* later or adding an API to preallocate a buffer while writing. But I would prefer to not try to solve all use cases right now. I prefer to start with a minimum API, and later see how it can be extended to fit more use cases.

serhiy-storchaka commented 3 weeks ago

Most if not all decoders can be rewritten with using PyUnicodeWriter. Old C API can create a PyUnicodeWriter, decode into it and return the resulting string. New C API can decode into the specified PyUnicodeWriter. But they need a way to resize the buffer in the middle of decoding.

encukou commented 3 weeks ago

I like this API. It's a good base to build on.

If we add PyUnicode_AsNativeFormat, we'll probably also want a PyUnicodeWriter_WriteNativeFormat, which will make PyUnicodeWriter_WriteUTF8 obsolete. (It would also handle wchar_t -- an alias to either the UCS-2 or UCS-4 format.) But, it's OK to add PyUnicodeWriter_WriteUTF8 now and either remove it before beta or keep it forever, as a simple alias.

To get rid of function call overhead in formatting (e.g. for Rust), we might want something like:

PyUnicodeWriter_WriteParts(PyUnicodeWriter *writer, struct PyUnicodeWriter_part *part, Py_ssize_t nparts);
struct PyUnicodeWriter_part {
    int32_t type;
    union {
        struct {const char *str; Py_ssize_t len;};
        struct {PyObject *obj; Py_ssize_t start; Py_ssize_t end;};
        Py_UCS4 ch;
    };
};

That could also be more efficient. For example, when writing (pseuducode) {type=LATIN1, str="a"*1000, len=1000}, {type=PYUNICODE, obj=pyunicode("á")}, {type=PYUNICODE, obj=pyunicode("😸")}, for the first string we could optimize away:

double conversion (UCS1->UCS2 and UCS2->UCS4)
scanning for the max character (from the last one, we know we need UCS4)

But, we can't replace the proposed API with this, since with a WriteParts you can't reuse a buffer for different parts. So, it can be a future addition.

zooba commented 3 weeks ago

we might want something like...

At risk of going further off topic, don't we have something like this for processing f-strings? (I love the concept, btw, but it's an addition, not a criticism of the current proposal.)

With the change to preallocate once rather than the overallocation flag, I'm happy with this API.

vstinner commented 3 weeks ago

I'm not convinced that PyUnicodeWriter_WriteParts() is needed, its benefits should be measured by benchmarks. But its design is interesting.

zooba commented 3 weeks ago

I'm not convinced that PyUnicodeWriter_WriteParts() is needed, its benefits should be measured by benchmarks.

It's not needed for performance over PyUnicodeWriter_Format, it's needed for compatibility for languages that can't use the C varargs calling convention.

vstinner commented 3 weeks ago

I modified the API multiple times to address different reviews. IMO it's now ready for a vote.

Are you fine with the proposed API?

[x] @encukou
[x] @erlend-aasland
[x] @gvanrossum
[x] @vstinner
[x] @zooba

vstinner commented 3 weeks ago

Great, the API is adopted by the C API Working Group, thanks :-)

vstinner commented 3 weeks ago

@encukou:

If we add https://github.com/python/cpython/issues/119609, we'll probably also want a PyUnicodeWriter_WriteNativeFormat, which will make PyUnicodeWriter_WriteUTF8 obsolete.

The advantage of a PyUnicodeWriter API is that you can pass it to a function which will write into it. We don't have to put all methods into PyUnicodeWriter. There are already multiple functions accepting a private _PyUnicodeWriter: https://github.com/python/cpython/issues/119182#issuecomment-2120453747. Internally, the private _PyUnicodeWriter can be used, it has more features ;-)

malemburg commented 2 weeks ago

Thanks.

I've added my comments to the PR: https://github.com/python/cpython/pull/119184/files

malemburg commented 2 weeks ago

May I request to reopen this, so that the error handling situation can be improved prior to making this official ?

gvanrossum commented 2 weeks ago

Sure.

serhiy-storchaka commented 2 weeks ago

There is no errors parameter in PyUnicode_FromString() and other function that takes the UTF-8 encoded C string. There is special function PyUnicode_DecodeUTF8() for this.

But in this case UTF8 is already included in the function name. It is difficult to add different function with UTF8 in the name which has the errors parameter without causing confusion. So it makes more sense to add the errors parameter from beginning.

In future we may add functions for writing strings decoded from UTF16, UTF32, the specified encoding, and from wchar_t, so it is better to have some idea about future names.

vstinner commented 2 weeks ago

May I request to reopen this, so that the error handling situation can be improved prior to making this official ?

@malemburg asked two things about error handling on the PR.

What is the behavior on error?
Can we add errors parameter to PyUnicodeWriter_WriteUTF8()?

For errors, I just modified the implementation to make all operations atomic. Either the whole string is written, or nothing is written. It's possible to recover from an error and continue using the writer.

For the UTF-8 error handler, I would prefer to leave the WriteUTF8() API simple since it's the most obvious choice to write a C string. It sound annoying to add NULL argument to all calls. If you want to use an error handler different than "strict"", just use PyUnicode_DecodeUTF8() + PyUnicodeWriter_WriteStr(). If later, this use case becomes common enough, we can consider adding a helper function for that. In terms of performance, I don't expect a major difference, it's more about making the API more convenient.

serhiy-storchaka commented 2 weeks ago

The writer is not usable after error (except PyUnicodeWriter_Discard()). Writing even a single character may require resizing the Unicode object. If resizing fails, we lose all previously written data.

vstinner commented 2 weeks ago

The writer is not usable after error (except PyUnicodeWriter_Discard()). Writing even a single character may require resizing the Unicode object. If resizing fails, we lose all previously written data.

The implementation doesn't work like that. WriteChar() allocates 1 character in the buffer. If the allocation fails, it returns an error. You can still use the writer (ex: after freeing memory), it stays in a consistent state. If you see anything wrong, please comment the PR directly.

serhiy-storchaka commented 2 weeks ago

It seems that my idea of how it works were wrong. My apologies for the noise.

malemburg commented 2 weeks ago

There is no errors parameter in PyUnicode_FromString() and other function that takes the UTF-8 encoded C string.

When this function was introduced in Python 2.6, it took a Latin-1 encoded char string as input. The Latin-1 codec cannot fail, so no errors parameter was needed. In Python 3, the meaning was changed to take a UTF-8 encoded string, which can* fail, so I would say that not introducing an errors parameter was an oversight when changing the API in Python 3. It would have been better to drop the API altogether rather than silently change the semantics, and point people to PyUnicode_DecodeUTF8() instead (which does have the errors parameter).

For errors, I just modified the implementation to make all operations atomic. Either the whole string is written, or nothing is written. It's possible to recover from an error and continue using the writer.

Thanks, this is much better.

For the UTF-8 error handler, I would prefer to leave the WriteUTF8() API simple since it's the most obvious choice to write a C string.

The internal writer API already provides all the logic necessary to handle errors correctly and with all possibilities the PyUnicode API supports, so not exposing this would actually make things more complex for the user and less efficient, since the whole point of the writer API is to avoid having to create temporary objects. If I have to resort back to creating temporary object, I might as well create a list of them and do a final join, skipping the writer API altogether.

As for UTF-8 being the being most obvious choice, this is probably true in many areas today, but it's also telling that the internal writer API does not provide an API to write UTF-8 encoded char* strings and instead provides APIs for ASCII and Latin-1, which the public API does not expose. The Latin-1 API cannot fail with encoding errors. The ASCII API can fail, but the current code does not provide error handling and would have to be rewritten to do so. Exposing those can probably wait a bit.

vstinner commented 2 weeks ago

@malemburg:

The internal writer API already provides all the logic necessary to handle errors correctly and with all possibilities the PyUnicode API supports, so not exposing this would actually make things more complex for the user and less efficient, since the whole point of the writer API is to avoid having to create temporary objects.

Proposed API is the bare minimum viable API. I plan to extend it later. But I would like that the first iteration has a clean, convenient and simple API.

Multiple extensions have already been requested in previous comments, such as accepting wchar_t*, accepting UCS2 and UCS4 strings, adding an errors parameter to decide how to decode UTF-8, etc. I would prefer to discuss them separately. My point is that the initial API doesn't have to cover all use cases, as soon as there is a simple way to implement it on top of the initial API.

vstinner commented 2 weeks ago

I plan to merge the approved PR next days unless the C API Working Group considers that we have to reopen the discussion about a errors parameter.

zooba commented 2 weeks ago

I would prefer to discuss them separately. My point is that the initial API doesn't have to cover all use cases, as soon as there is a simple way to implement it on top of the initial API.

It looks like the reason for reopening now is that the current, minimal API would prevent us from adding an errors parameter in the obvious place later, and so it's worth holding it up now.

It seems reasonable to me to either rename _WriteUTF8 to _WriteString if it's going to look like the existing _FromString function, or adding an errors parameter now.

I also don't have a particular problem with requiring valid UTF-8 and returning an error if it isn't, provided it doesn't slow down the successful case too badly (i.e. don't decode the string twice). _WriteUTF8WithErrors or _DecodeUTF8 can be added later.

gvanrossum commented 2 weeks ago

I haven't followed this latest discussion, but let's not rush anything.

vstinner commented 2 weeks ago

Ok, I have an offer :-)

Remove size parameter of PyUnicodeWriter_WriteUTF8() (!). It will be fine, read below.
Add PyUnicodeWriter_DecodeUTF8() function.

PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8(
    PyUnicodeWriter *writer,
    const char *str);

int PyUnicodeWriter_DecodeUTF8(
    PyUnicodeWriter *writer,
    const char *string,
    Py_ssize_t length,
    const char *errors,
    Py_ssize_t *consumed);

You can see PyUnicodeWriter_DecodeUTF8() as the feature full API: it adds 3 parameters, length, errors and consumed.

In the current implementation, all calls to PyUnicodeWriter_WriteUTF8() pass -1 as the size. IMO this parameter is useless for the common use case. Use PyUnicodeWriter_DecodeUTF8(writer, str, size, NULL, NULL) if you want to pass the size.

You can see an "unicode writer" as a file: FILE* or int fd in C. We don't have to add methods for every possible string type or "write operation" type directly in the PyUnicodeWriter API. We can also consider add "methods" (functions) to the PyUnicode API family which would accept a writer parameter.

For example, why only accepting UTF-8? What about UTF-16, Latin1, ASCII, Big5, ShiftJIS, and other encodings? If we consider more codecs, should we add a function which accept an encoding name? Ok, but it should also take an errors parameter to specific to support error handlers other than "strict". Also, what if I want a stateful decoder, I also need a consumed parameter to be able to continue decoding, keep the state between two calls.

Supporting the full decoder API is not trivial. Python has powerful APIs for that!

@zooba:

It seems reasonable to me to either rename _WriteUTF8 to _WriteString

The function was called PyUnicodeWriter_WriteASCIIString(). It was renamed to PyUnicodeWriter_WriteString() to accept UTF-8. The name was too close to PyUnicodeWriter_WriteStr() (which writes str(obj)), so I rename it to PyUnicodeWriter_WriteUTF8().

PyUnicodeWriter_WriteASCIIString() => PyUnicodeWriter_WriteString() => PyUnicodeWriter_WriteUTF8().

If we decide to add convenient methods for decoders to PyUnicodeWriter, I would prefer to reuse PyUnicode "Decode" term. For example:

int PyUnicodeWriter_DecodeUTF8(
    PyUnicodeWriter *writer,
    const char *string,
    Py_ssize_t length,
    const char *errors);  // <= add errors

int PyUnicodeWriter_DecodeUTF8Stateful(
    PyUnicodeWriter *writer,
    const char *string,
    Py_ssize_t length,
    const char *errors,
    Py_ssize_t *consumed);   // <= add consumes

It would be easier to navigate between such API and the existing APIs (similar names).

@zooba:

or adding an errors parameter now.

If we change PyUnicodeWriter_WriteUTF8(), I would prefer to remove the size parameter, instead of adding a parameter. And use a different function to pass a length (ex: PyUnicodeWriter_DecodeUTF8).

Current API:

    if (PyUnicodeWriter_WriteUTF8(writer, "Hello ", -1) < 0)
        goto error;
    }
    const char *name = "World";
    if (PyUnicodeWriter_WriteUTF8(writer, name, -1) < 0)
        goto error;
    }

Add errors:

    if (PyUnicodeWriter_WriteUTF8(writer, "Hello ", -1, NULL) < 0)
        goto error;
    }
    const char *name = "World";
    if (PyUnicodeWriter_WriteUTF8(writer, name, -1, NULL) < 0)
        goto error;
    }

Without size:

    if (PyUnicodeWriter_WriteUTF8(writer, "Hello ") < 0)
        goto error;
    }
    const char *name = "World";
    if (PyUnicodeWriter_WriteUTF8(writer, name) < 0)
        goto error;
    }

malemburg commented 2 weeks ago

In the current implementation, all calls to PyUnicodeWriter_WriteUTF8() pass -1 as the size.

You are probably referring to the PR and this only uses it in tests AFAICT, so no surprise there 😃

Overall, I think we should at least try to have consistent APIs, which similar signatures when dealing with more or less the same things. It may be convenient in some cases to drop parameters, but adding a NULL parameter for the sake of consistency isn't such a big burden.

> int PyUnicodeWriter_DecodeUTF8(
>     PyUnicodeWriter *writer,
>     const char *string,
>     Py_ssize_t length,
>     const char *errors,
>     Py_ssize_t *consumed);

If you want to add consumed, please call this PyUnicodeWriter_DecodeUTF8Stateful() for consistency.

FWIW: I don't think we need to add too many helpers for writing data to a Unicode string. We should add ones which occur often in real code, though.

The main use case is taking data from other external C sources and building Python Unicode strings from them iteratively - where the final size is not known upfront.

Common use cases are:

char* UTF-8 strings (typically used for files and network protocols)
char* ASCII strings (but those can be handled using the UTF-8 helpers as well)
wchar_t* strings (these come in 2-byte and 4-byte variants, with the 2-byte variant usually being UTF-16; avoids decoding/encoding when dealing with other C libraries; very common on Windows, less so on Unix)
Py_UCS4* strings (mostly for raw data interchange with other code that doesn't want to bother with different wchar_t sizes)

encukou commented 2 weeks ago

Remove size parameter of PyUnicodeWriter_WriteUTF8()

Please don't. CPython's use is biased toward short ASCII-only names (identifiers, encoding names, etc. -- stuff with a limimted character set). For general data we should allow embedded NULs, and only allow len=-1 as a convenience for “names”.

Add PyUnicodeWriter_DecodeUTF8() function

As you said, that can be UnicodeWriter_Decode with encoding as an argument (possibly with NULL meaning UTF-8). It's a much more powerful API.

I agree that an errors argument is better left for a future powerful API like this.

serhiy-storchaka commented 2 weeks ago

char* strings used for files usually are not UTF-8 encoded, but FS-encoded.

malemburg commented 2 weeks ago

char* strings used for files usually are not UTF-8 encoded, but FS-encoded.

Sorry, I meant "file data", not filenames.

Even for filenames, things are typically UTF-8 encoded nowadays. UTF-8 is becoming more and more a standard for anything text related.

erlend-aasland commented 2 weeks ago

I agree with Petr; let's keep the size param.

Steve:

I also don't have a particular problem with requiring valid UTF-8 and returning an error if it isn't [...]

I'm also fine with the existing API that requires strict format.

For Marc-André's use-case, perhaps it is better solved as a docs issue for now; we can point to PyUnicode_DecodeUTF8 from the PyUnicodeWriter_WriteUTF8 docs.

Petr:

I agree that an errors argument is better left for a future powerful API like this.

+1

vstinner commented 2 weeks ago

For Marc-André's use-case, perhaps it is better solved as a docs issue for now; we can point to PyUnicode_DecodeUTF8 from the PyUnicodeWriter_WriteUTF8 docs.

Sure, I added it to the documentation.

vstinner commented 2 weeks ago

@zooba @gvanrossum: Are you ok to keep PyUnicodeWriter_WriteUTF8() as it is and postpone addition to the API to a following step?

PyAPI_FUNC(int) PyUnicodeWriter_WriteUTF8(
    PyUnicodeWriter *writer,
    const char *str,  // decoded from UTF-8
    Py_ssize_t len);  // use strlen() if len < 0

malemburg commented 2 weeks ago

As a general note: It's fine to not add all bells and whistles to the API right from the start, but please do take the most common use cases into account.

If an API does not provide ways to handle errors using the error callback logic we have in the codecs sub-system, users of the API will always have to revert to creating temporary Unicode objects to get around this limitation.

This makes using the API difficult wherever you have to deal with data which isn't 100% clean (literal strings as used in the tests don't count as examples 😃).

Now, the whole point of the writer API is to make building Unicode strings easy, so failing on making it so, will only result in people writing their own APIs using the private writer APIs or resorting to build-a-list-then-join approach. Both are not ideal and not what we want to achieve.

Please also note that the code needed to handle errors properly is already there. It's just not exposed:

int
PyUnicodeWriter_WriteUTF8(PyUnicodeWriter *writer,
                          const char *str,
                          Py_ssize_t size)
{
    if (size < 0) {
        size = strlen(str);
    }

    _PyUnicodeWriter *_writer = (_PyUnicodeWriter*)writer;
    Py_ssize_t old_pos = _writer->pos;
    int res = unicode_decode_utf8_writer(_writer, str, size,
                                         _Py_ERROR_STRICT, NULL, NULL);
    if (res < 0) {
        _writer->pos = old_pos;
    }
    return res;
}

_Py_ERROR_STRICT, NULL, NULL is hiding the error handling logic, so adding full support for error handling is really, really easy.

It may actually be better to focus more on exposing PyUnicodeWriter_DecodeUTF8Stateful() to start with and then add e.g. PyUnicodeWriter_WriteUTF8() as a simplified wrapper afterwards.

I have a hard time understanding why we have to fight over each new C API (rich APIs are better and have always been), but if you absolutely want to trim down the API, you could consider removing PyUnicodeWriter_WriteRepr() instead, since this will only have very limited use in real life -- most uses of repr(obj) are inside format strings and the writer API already supports this.

vstinner commented 1 week ago

@zooba @malemburg: I created a draft PR to discuss adding PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful(), once the "base" PyUnicodeWriter API will be added: https://github.com/python/cpython/pull/120639

PyAPI_FUNC(int) PyUnicodeWriter_WriteWideChar(
    PyUnicodeWriter *writer,
    wchar_t *str,
    Py_ssize_t size);

PyAPI_FUNC(int) PyUnicodeWriter_DecodeUTF8Stateful(
    PyUnicodeWriter *writer,
    const char *string,         /* UTF-8 encoded string */
    Py_ssize_t length,          /* size of string */
    const char *errors,         /* error handling */
    Py_ssize_t *consumed);      /* bytes consumed */

zooba commented 1 week ago

Are you ok to keep PyUnicodeWriter_WriteUTF8() as it is and postpone addition to the API to a following step?

Sure.

I created a draft PR to discuss adding PyUnicodeWriter_WriteWideChar() and PyUnicodeWriter_DecodeUTF8Stateful(), once the "base" PyUnicodeWriter API will be added

If they're consistent with either the existing PyUnicodeWriter functions or their equivalent PyUnicodeObject functions, I doubt I'll have any concerns. If you feel a need to diverge from the precedents, then please @ me and point out the changes.

vstinner commented 1 week ago

Ok, let's start with this API as a first iteration!

malemburg commented 1 week ago

@vstinner Fair enough, as long as both PRs land in 3.14.

capi-workgroup / decisions

Add PyUnicodeWriter API #27