capi-workgroup / decisions

Discussion and voting on specific issues
4 stars 1 forks source link

Add PyUnicode_Export() and PyUnicode_Import() to the limited C API #33

Open vstinner opened 4 days ago

vstinner commented 4 days ago

I propose adding 3 functions to the limited C API to export/import Python str objects:

#define PyUnicode_FORMAT_ASCII 0x01  // Py_UCS1* (ASCII string)
#define PyUnicode_FORMAT_UCS1 0x02   // Py_UCS1*
#define PyUnicode_FORMAT_UCS2 0x04   // Py_UCS2*
#define PyUnicode_FORMAT_UCS4 0x08   // Py_UCS4*
#define PyUnicode_FORMAT_UTF8 0x10   // char*

// Get the content of a string in the requested format:
// - Set '*data', '*nbytes' and '*format', and return 0 on success.
// - Set an exception and return -1 on error.
//
// The export must be released by PyUnicode_ReleaseExport().
PyAPI_FUNC(int) PyUnicode_Export(
    PyObject *unicode,
    uint32_t requested_formats,
    const void **data,
    Py_ssize_t *nbytes,
    uint32_t *format);

// Release an export created by PyUnicode_Export().
PyAPI_FUNC(void) PyUnicode_ReleaseExport(
    PyObject *unicode,
    const void* data,
    uint32_t format);

// Create a string object from a string in the format 'format'.
// - Return a reference to a new string object on success.
// - Set an exception and return NULL on error.
PyAPI_FUNC(PyObject*) PyUnicode_Import(
    const void *data,
    Py_ssize_t nbytes,
    uint32_t format);

PyUnicode_Export() requested_formats selects the most suitable character format for the export. In most cases, there is no memory copy. If a memory copy is needed, PyUnicode_ReleaseExport() is responsible to release it.

I don't know if we should add a convenient constant for "all native formats", something like:

#define PyUnicode_FORMAT_NATIVE \
    (PyUnicode_FORMAT_ASCII \
     | PyUnicode_FORMAT_UCS1 \
     | PyUnicode_FORMAT_UCS2 \
     | PyUnicode_FORMAT_UCS4)

UTF-8 is not a native format in CPython, but it's the native format of PyPy.

PyUnicode_Import() is similar to PyUnicode_FromKindAndData(), but PyUnicode_FromKindAndData() is excluded from the limited C API.

encukou commented 4 days ago

I would prefer to use Py_buffer rather than 3 separate pieces of info that need to be kept around and passed to PyUnicode_ReleaseExport, but I'm OK with this too.

If a memory copy is needed, PyUnicode_ReleaseExport() is responsible to release it.

I'll note that the user must always call PyUnicode_ReleaseExport; there's no public indication about whether you have a copy or not. The intent is that if string internal layout changes in the future, this API will still work as before -- it'll just be slower and use more memory.

I don't know if we should add a convenient constant for "all native formats"

I think we should not, unless there's a use case I don't see.

I'm OK with a library using knowledge of CPython internals to “guess” that UCSn will be preferred to UTF-8 if all of UCS1-UCS4 are given. The price of guessing wrong is “only” a performance decrease. And to support PyPy's native format, you should add UTF-8 to the formats you support rather than look at PyUnicode_FORMAT_NATIVE.

vstinner commented 2 days ago

I'll note that the user must always call PyUnicode_ReleaseExport

Right, that's what the documentation says. It must always be called. I was just describing the implementation.

I think we should not, unless there's a use case I don't see.

The idea of PyUnicode_NATIVE_FORMATS would be to be a mask of formats which are supported natively by CPython or PyPy. CPython would define it as:

#define PyUnicode_NATIVE_FORMATS \
    (PyUnicode_FORMAT_ASCII \
     | PyUnicode_FORMAT_UCS1 \
     | PyUnicode_FORMAT_UCS2 \
     | PyUnicode_FORMAT_UCS4)

PyPy would define it as:

#define PyUnicode_NATIVE_FORMATS PyUnicode_FORMAT_UTF8

But well, we can add it later if needed.

malemburg commented 2 days ago

I think it would be better to use a slightly different signature which separates error reporting from returning data. The buffer API (and many other Python C APIs) use the same approach.

// Get the content of a string in one of the requested formats:
// - Set `**data`, '*nbytes' and '*format' on success, with the data available in `*data`
// - Set an exception and return -1 on error, 0 on success.
// - The function tries to avoid copying, so passing in multiple formats is advised.
//
// The export must be released using PyUnicode_ReleaseExport() in all cases.
//
PyAPI_FUNC(int) PyUnicode_Export(
    PyObject *unicode,
    uint32_t requested_formats,
    const void **data,
    Py_ssize_t *nbytes,
    uint32_t *format);

The release function may also need an int return value to report errors, e.g. if it finds that the data buffer does not match the Unicode object one, even though the format is the correct one.

I'm also not sure how PyUnicode_ReleaseExport() will deal with exports which do require copying data, e.g. if the Unicode object has a different kind than requested. In the PR, the function basically is a no-op in most cases, which looks wrong.

Update: Fixed a bug in the sig and clarified the comment wording a bit more.

vstinner commented 2 days ago

I'm also not sure how PyUnicode_ReleaseExport() will deal with exports which do require copying data, e.g. if the Unicode object has a different kind than requested. In the PR, the function basically is a no-op in most cases, which looks wrong.

If format doesn't match the Python str kind, a copy is needed. In this case, PyUnicode_ReleaseExport() release the memory. Example:

mdboom commented 1 day ago

I would prefer to use Py_buffer rather than 3 separate pieces of info that need to be kept around and passed to PyUnicode_ReleaseExport, but I'm OK with this too.

+1 to this from me, but if it were a memoryview, it could just be DECREF'd. That would also allow in the non-copy case to keep the string alive as long as all views on it are alive.

What are the intended semantics when PyUnicode_Export is followed by freeing the string, and then using the returned buffer? It seems like that would work if the types didn't match, and it copied, but not if it was non-copying. I think using a memoryview would sidestep that problem by giving us the ability to properly reference count the buffers.

vstinner commented 1 day ago

How do you express for 5 different formats in Py_buffer format? There are no ASCII, UCS1, UCS2 and UTF-8 formats in https://docs.python.org/dev/library/array.html table, only "w" for UCS4.

vstinner commented 1 day ago

@malemburg:

I think it would be better to use a slightly different signature which separates error reporting from returning data.

Ok, I made this change. Previously, I hesitated to do it, the API looked "unnatural" and difficult to use. So I like better returning -1 on error or 0 on success.

mdboom commented 18 hours ago

How do you express for 5 different formats in Py_buffer format? There are no ASCII, UCS1, UCS2 and UTF-8 formats in https://docs.python.org/dev/library/array.html table, only "w" for UCS4.

Use b, b, h, respectively? The fact that it's text is erased from the void * pointer returned in the original proposal here anyway, so I don't think using ints of the right type here is any worse.

mdboom commented 15 hours ago

To clarify, I think my main concern is that, as proposed this doesn't return a PyObject * of some sort, so the memory management has unexpected pitfalls and doesn't really match much of the rest of the C API. Particularly, think of code that always requests UCS4 -- it might always work as long as the string contains only ASCII, and thus is a copy, but then the user puts an emoji in the string and suddenly it's not a copy and the assumptions in the user's code breaks. I think semantic changes like this based on the values of objects can be particular difficult to test and confirm correctness of (for the C API's users). By returning a PyObject * and using reference counting semantics, at least it will match much of the rest of the C API.

I feel less strongly about whether the PyObject * returned is a memoryview or some new PyObject * subclass just for this purpose. Reusing memoryview reduces the amount of new API, but as @vstinner pointed out, maybe (at least for self-descriptive reasons) it would need to grow new data types to be more specific that it contains string data.