Add PyBytes_Join() function

vstinner commented 4 months ago

API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)

Similar to sep.join(iterable) in Python.

sep must be Python bytes object.

iterable must be an iterable object yielding objects that implement the buffer protocol.

On success, return a new bytes object. On error, set an exception and return NULL.

UPDATE: Don't accept sep=NULL.

It's different than PyUnicode_Join(NULL, iterable) which treats NULL separator as a whitespace (' '). This PyUnicode_Join() behavior is not documented. The PyUnicode_Join() documentation only says:

Join a sequence of strings using the given separator and return the resulting Unicode string.

encukou commented 4 months ago

Looks good to me! I agree that b' ' doesn't make much sense as a default for bytes.

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it, and PyBytes_Join docs should mention that the default is different.

vstinner commented 4 months ago

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it

I agree. I don't think that it would be a good idea to change the behavior because the function exists since forever in Python (ex: it exists in Python 2.7 with NULL treated as a whitespace).

malemburg commented 4 months ago

FYI: This originates from the default for string.join() (the module function) in Python 2. The default separator was a blank. It's been the default for PyUnicode_Join() ever since the API was added to Python.

Today, I would not allow for this corner case anymore, though. Passing in NULL as first argument is bound to mask potential errors in code,

picnixz commented 4 months ago

We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it

Isn't it possible to change its behaviour using the proper deprecation process? (or is it impossible because it's part of the stable ABI that we cannot touch it like this?)

encukou commented 4 months ago

It's possible, yes. But it would mean that everyone who uses it this way needs to update their code. It would be quite cruel of us to do it without a very good reason.

picnixz commented 4 months ago

Woulnd't be the following be legitimate reasons: 1) it's not documented 2) it's something introduced for Python 2, 3) it would be inconsistent with PyBytes_Join?

encukou commented 4 months ago

None of those are reasons to change it. A reason to change it would be, as Mark said, that passing NULL can mask potential errors in code. IMO, that on its own is a rather weak reason to break any working code.

It doesn't really matter if it's documented or not. If the docs, are missing, people read the code. Or they definitely did it in the past. (Also see PEP-387: “Note that if something is not documented at all, it is not automatically considered private.”)
If it's been around since Python 2, there are decades of code that use it and would need to change. This is an argument against changing it.
If we want consistency, we should look at the newly added function: PyBytes_Join should use ASCII space by default, or not allow NULL at all. But, we seem to agree that defaulting to b'' is worth the inconsistency. Bytes and text are different beasts; the default separator can be different too.

vstinner commented 4 months ago

All usages of the current private _PyBytes_Join() in the Python code base are with an empty bytes string, so IMO it's worth it and convenient to accept NULL and treat NULL as an empty bytes string. It avoids having to "create" the empty bytes string singleton, handle errors, etc.

malemburg commented 4 months ago

To avoid the same error masking issue as with PyUnicode_Join() I'd suggest to not use NULL as default parameter, but instead a use separate macro PY_BYTES_EMPTY or perhaps even an interned and immortal singleton Py_BYTES_EMPTY (haven't checked whether we already have something like this).

malemburg commented 4 months ago

Had a look... we already have something like this in form of bytes_get_empty() in bytesobject.c. It's just not exposed in the public API.

picnixz commented 4 months ago

You can just do Py_GetConstant(Py_CONSTANT_EMPTY_BYTES) as well. This is how we publicly expose singletons.

malemburg commented 4 months ago

Thanks for mentioning this. I wasn't aware of that new API: https://docs.python.org/3.14/c-api/object.html#c.Py_GetConstant

Unfortunately, this returns a strong reference, so you'd still have the ref count manage the object instead of just doing PyBytes_Join(Py_EMPTY_BYTES, iterator)

There is Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES), which could be used instead, but that seems very verbose for such a simple and common parameter value.

serhiy-storchaka commented 3 months ago

We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().

encukou commented 3 months ago

Or have the sep=NULL branch check PyErr_Occurred(), and raise a stern SystemError. (I'm assuming the main error to worry about is using a return value of a Py* function without error checking.)

PyUnicode_Join could do that as well.

malemburg commented 3 months ago

Both solutions sound like a good alternative approach. Petr's version would even solve the potential issue with PyUnicode_Join().

My concern is mostly about passing in NULL as the first parameter, since you normally would pass in the object you want to work on as this parameter. A forgotten NULL check could then easily result in the join function doing it's job and leaving a dangling error around which would then show up at some later point in the execution of the program - which is really hard to debug. I've run into such issues too often to not pay close attention to this anymore.

While this can be an issue with other parameters as well, the first one is special, since working on NULLs is rather uncommon 😄

vstinner commented 3 months ago

We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().

I like this approach.

vstinner commented 3 months ago

I propose to:

Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string
Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set
Don't accept NULL in PyBytes_Join()

picnixz commented 3 months ago

I like those suggestions. When you say "don't accept NULL in PyBytes_Join", do you mean a simple assert? For PyUnicode_Join, do you mean calling PyErr_BadInternalCall() or having something more explicit (i.e., a better message)?

malemburg commented 3 months ago

I propose to:

Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string

Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set

Don't accept NULL in PyBytes_Join()

Sounds good.

vstinner commented 3 months ago

I mean PyErr_BadInternalCall() yes, raise SystemError.

picnixz commented 3 months ago

To summarize:

a. Make a PR where you call PyErr_BadInternalCall when calling PyUnicode_Join with a NULL. b. Handle Py_None as being equivalent to "". In particular, we don't have a special casing for a whitespace " " anymore.

Should this change be backported to 3.12 and 3.13 as well without notice? Or should it only be a 3.14 change?
a. Update your PR for PyBytes_Join to call PyErr_BadInternalCall if NULL is passed as a separator. b. Update your PR to accept Py_None as being equivalent to b"".

serhiy-storchaka commented 3 months ago

I think that in this case we may add a SystemError with more specific error message (similar to these that are raised when C implemented function returns non-NULL with an error set). It can also be chained with the original exception. But this is an implementation detail.

I would prefer to add special references for empty str and bytes to the public C API, but using Py_None is the second best option. Definitely better than using NULL with different semantic.

After adding _PyLong_Zero and _PyLong_One I thought about adding corresponding global constants for empty string, bytes object, tuple, etc, but did not have enough use cases for them. Since then, evolution has gone in a different direction, _PyLong_Zero and _PyLong_One were replaced with _PyLong_GetZero() and _PyLong_GetOne() which return a borrowed reference. I think this made them less ready for the public C API.

encukou commented 3 months ago

It's all personal opinions now. As for me, I don't like using one Python object to stand in for another.

Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)

I'd prefer any of:

Using b'' itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES). If you use it once, it's nice and descriptive; if you need it many times you can #define a short name.
Using NULL to mean no separator (with the PyErr_Occurred() check, it's not that dangerous)
A separate one-arg function, like PyBytes_ConcatIterable

picnixz commented 3 months ago

I would prefer to add special references for empty str and bytes to the public C API,

If we can, I would also prefer it. Returning an empty string or using an empty string might be common enough (for instance search for PyUnicode_FromString("")) and it would reflect the usage in the code (like, you'll read the code and translate it in your head as "".join(...) for instance and not None.join(...)).

We seem to have #define emptystring (PyObject *)&_Py_SINGLETON(bytes_empty) for code objects, (bytes_empty is in _Py_static_objects but there does not seem to have a PyUnicodeObject containing the empty string). So we could do the same for an empty string maybe? (or just expose the macro itself in a clearer way).

erlend-aasland commented 3 months ago

I'm fine with either Py_None or Petr's first alternative:

Using b'' itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES). If you use it once, it's nice and descriptive; if you need it many times you can #define a short name.

vstinner commented 3 months ago

Accepting NULL is causing too much trouble:

Maybe passing NULL was not the intent, but the result of a failing function call. I don't want to add PyErr_Occurred() in this case, worst errors don't even set an exception.
PyUnicode_Join(NULL, iterable) uses a space rather than an empty string.

I prefer to abandon the NULL idea at this point.

vstinner commented 3 months ago

@encukou:

Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)

I'm not aware of any existing C API doing that, so maybe Py_None is a bad idea here, especially because getting an empty bytes string became cheap and easy (Py_GetConstantBorrowed) in Python 3.13.

vstinner commented 3 months ago

Ok, let's vote on the simple API: sep must always be a Python bytes object (it cannot be NULL, it cannot be Py_None).

API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)

Similar to sep.join(iterable) in Python.
sep must be Python bytes object.
iterable must be an iterable object yielding objects that implement the buffer protocol.
On success, return a new bytes object. On error, set an exception and return NULL.

Vote:

[x] @encukou
[x] @erlend-aasland
[x] @mdboom
[x] @serhiy-storchaka
[x] @vstinner
[x] @zooba

malemburg commented 3 months ago

If we go with the above proposal, please add a macro to return a borrowed reference to the empty bytes constant (= Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)) and similarly for the empty Unicode constant.

vstinner commented 3 months ago

Ping @mdboom and @zooba for the vote ;-)

zooba commented 3 months ago

_PyLong_Zero and _PyLong_One were replaced with _PyLong_GetZero() and _PyLong_GetOne() which return a borrowed reference. I think this made them less ready for the public C API.

Less ready for a C API, true, but more ready for a generic native API (that can support languages other than C), as well as more ready for a thread-aware API. It was a worthwhile change.

please add a macro to return a borrowed reference to the empty bytes constant

Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES and Py_EMPTY_STR that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.

picnixz commented 3 months ago

Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES and Py_EMPTY_STR that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.

Could we do it for all known constants to be consistent? there are multiple places where empty str/bytes are being returned and some files use local helpers for that. I think we can have a PR only for this (namely implement a correspondence between constants and macros and remove those local helpers).

zooba commented 3 months ago

Provided there are no name conflicts, sure. Macros are cheap, and I believe all of these constants are already immortal/true-constant, which means there's no likely future where refcounting will actually matter.

We do want to deprecate functions that return borrowed references, as they make refcounting very complicated. But these constants are effectively tagged pointers now rather than live objects (the refcount is still writable, but properly-built extensions will leave it alone, and they are interpreter- and thread-agnostic), so whether strong or borrowed isn't a big deal.

vstinner commented 3 months ago

@mdboom: You didn't vote yet in https://github.com/capi-workgroup/decisions/issues/36#issuecomment-2263232549 - what's your call on this API?

vstinner commented 3 months ago

The C API Working Group adopted PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable) API where sep must be a bytes object (cannot be NULL). I close the issue and I will update the PR https://github.com/python/cpython/pull/121646.

capi-workgroup / decisions

Add PyBytes_Join() function #36