Closed vstinner closed 3 months ago
Looks good to me! I agree that b' '
doesn't make much sense as a default for bytes.
We can't change that behaviour of PyUnicode_Join
with NULL
, but IMO we should document it, and PyBytes_Join
docs should mention that the default is different.
We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it
I agree. I don't think that it would be a good idea to change the behavior because the function exists since forever in Python (ex: it exists in Python 2.7 with NULL treated as a whitespace).
FYI: This originates from the default for string.join() (the module function) in Python 2. The default separator was a blank. It's been the default for PyUnicode_Join() ever since the API was added to Python.
Today, I would not allow for this corner case anymore, though. Passing in NULL as first argument is bound to mask potential errors in code,
We can't change that behaviour of PyUnicode_Join with NULL, but IMO we should document it
Isn't it possible to change its behaviour using the proper deprecation process? (or is it impossible because it's part of the stable ABI that we cannot touch it like this?)
It's possible, yes. But it would mean that everyone who uses it this way needs to update their code. It would be quite cruel of us to do it without a very good reason.
Woulnd't be the following be legitimate reasons: 1) it's not documented 2) it's something introduced for Python 2, 3) it would be inconsistent with PyBytes_Join
?
None of those are reasons to change it. A reason to change it would be, as Mark said, that passing NULL can mask potential errors in code. IMO, that on its own is a rather weak reason to break any working code.
PyBytes_Join
should use ASCII space by default, or not allow NULL
at all.
But, we seem to agree that defaulting to b''
is worth the inconsistency. Bytes and text are different beasts; the default separator can be different too.All usages of the current private _PyBytes_Join()
in the Python code base are with an empty bytes string, so IMO it's worth it and convenient to accept NULL and treat NULL as an empty bytes string. It avoids having to "create" the empty bytes string singleton, handle errors, etc.
To avoid the same error masking issue as with PyUnicode_Join() I'd suggest to not use NULL as default parameter, but instead a use separate macro PY_BYTES_EMPTY or perhaps even an interned and immortal singleton Py_BYTES_EMPTY (haven't checked whether we already have something like this).
Had a look... we already have something like this in form of bytes_get_empty()
in bytesobject.c. It's just not exposed in the public API.
You can just do Py_GetConstant(Py_CONSTANT_EMPTY_BYTES)
as well. This is how we publicly expose singletons.
Thanks for mentioning this. I wasn't aware of that new API: https://docs.python.org/3.14/c-api/object.html#c.Py_GetConstant
Unfortunately, this returns a strong reference, so you'd still have the ref count manage the object instead of just doing PyBytes_Join(Py_EMPTY_BYTES, iterator)
There is Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)
, which could be used instead, but that seems very verbose for such a simple and common parameter value.
We could also use Py_None
as a special value for an empty separator in PyUnicode_Join()
and PyBytes_Join()
.
Or have the sep=NULL
branch check PyErr_Occurred()
, and raise a stern SystemError
.
(I'm assuming the main error to worry about is using a return value of a Py*
function without error checking.)
PyUnicode_Join
could do that as well.
Both solutions sound like a good alternative approach. Petr's version would even solve the potential issue with PyUnicode_Join()
.
My concern is mostly about passing in NULL as the first parameter, since you normally would pass in the object you want to work on as this parameter. A forgotten NULL check could then easily result in the join function doing it's job and leaving a dangling error around which would then show up at some later point in the execution of the program - which is really hard to debug. I've run into such issues too often to not pay close attention to this anymore.
While this can be an issue with other parameters as well, the first one is special, since working on NULLs is rather uncommon 😄
We could also use Py_None as a special value for an empty separator in PyUnicode_Join() and PyBytes_Join().
I like this approach.
I propose to:
I like those suggestions. When you say "don't accept NULL in PyBytes_Join
", do you mean a simple assert
? For PyUnicode_Join
, do you mean calling PyErr_BadInternalCall()
or having something more explicit (i.e., a better message)?
I propose to:
- Accept Py_None in PyBytes_Join() and PyUnicode_Join(): treated as an empty string
- Raise SystemError in PyUnicode_Join() if called with NULL separator with an exception set
- Don't accept NULL in PyBytes_Join()
Sounds good.
I mean PyErr_BadInternalCall() yes, raise SystemError.
To summarize:
a. Make a PR where you call PyErr_BadInternalCall
when calling PyUnicode_Join
with a NULL.
b. Handle Py_None
as being equivalent to ""
. In particular, we don't have a special casing for a whitespace " "
anymore.
Should this change be backported to 3.12 and 3.13 as well without notice? Or should it only be a 3.14 change?
a. Update your PR for PyBytes_Join
to call PyErr_BadInternalCall
if NULL is passed as a separator.
b. Update your PR to accept Py_None
as being equivalent to b""
.
I think that in this case we may add a SystemError with more specific error message (similar to these that are raised when C implemented function returns non-NULL with an error set). It can also be chained with the original exception. But this is an implementation detail.
I would prefer to add special references for empty str
and bytes
to the public C API, but using Py_None
is the second best option. Definitely better than using NULL with different semantic.
After adding _PyLong_Zero
and _PyLong_One
I thought about adding corresponding global constants for empty string, bytes object, tuple, etc, but did not have enough use cases for them. Since then, evolution has gone in a different direction, _PyLong_Zero
and _PyLong_One
were replaced with _PyLong_GetZero()
and _PyLong_GetOne()
which return a borrowed reference. I think this made them less ready for the public C API.
It's all personal opinions now. As for me, I don't like using one Python object to stand in for another.
Do we even have a precedent for C API taking Py_None
to mean “default”? (edit: other than implementing/mirroring Python functions that take None)
I'd prefer any of:
b''
itself -- that is, Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)
. If you use it once, it's nice and descriptive; if you need it many times you can #define
a short name.NULL
to mean no separator (with the PyErr_Occurred()
check, it's not that dangerous)PyBytes_ConcatIterable
I would prefer to add special references for empty str and bytes to the public C API,
If we can, I would also prefer it. Returning an empty string or using an empty string might be common enough (for instance search for PyUnicode_FromString("")
) and it would reflect the usage in the code (like, you'll read the code and translate it in your head as "".join(...)
for instance and not None.join(...)
).
We seem to have #define emptystring (PyObject *)&_Py_SINGLETON(bytes_empty)
for code objects, (bytes_empty
is in _Py_static_objects
but there does not seem to have a PyUnicodeObject
containing the empty string). So we could do the same for an empty string maybe? (or just expose the macro itself in a clearer way).
I'm fine with either Py_None
or Petr's first alternative:
Using
b''
itself -- that is,Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)
. If you use it once, it's nice and descriptive; if you need it many times you can#define
a short name.
Accepting NULL is causing too much trouble:
I prefer to abandon the NULL idea at this point.
@encukou:
Do we even have a precedent for C API taking Py_None to mean “default”? (edit: other than implementing/mirroring Python functions that take None)
I'm not aware of any existing C API doing that, so maybe Py_None is a bad idea here, especially because getting an empty bytes string became cheap and easy (Py_GetConstantBorrowed) in Python 3.13.
Ok, let's vote on the simple API: sep must always be a Python bytes object (it cannot be NULL, it cannot be Py_None).
API: PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)
sep.join(iterable)
in Python.bytes
object.bytes
object. On error, set an exception and return NULL
.Vote:
If we go with the above proposal, please add a macro to return a borrowed reference to the empty bytes constant (= Py_GetConstantBorrowed(Py_CONSTANT_EMPTY_BYTES)
) and similarly for the empty Unicode constant.
Ping @mdboom and @zooba for the vote ;-)
_PyLong_Zero
and_PyLong_One
were replaced with_PyLong_GetZero()
and_PyLong_GetOne()
which return a borrowed reference. I think this made them less ready for the public C API.
Less ready for a C API, true, but more ready for a generic native API (that can support languages other than C), as well as more ready for a thread-aware API. It was a worthwhile change.
please add a macro to return a borrowed reference to the empty bytes constant
Yeah, I think this is worth adding ourselves. Py_EMPTY_BYTES
and Py_EMPTY_STR
that call the GetConstantBorrowed
function aren't really adding any more risk or burden to the API.
Yeah, I think this is worth adding ourselves.
Py_EMPTY_BYTES
andPy_EMPTY_STR
that call the GetConstantBorrowed function aren't really adding any more risk or burden to the API.
Could we do it for all known constants to be consistent? there are multiple places where empty str/bytes are being returned and some files use local helpers for that. I think we can have a PR only for this (namely implement a correspondence between constants and macros and remove those local helpers).
Provided there are no name conflicts, sure. Macros are cheap, and I believe all of these constants are already immortal/true-constant, which means there's no likely future where refcounting will actually matter.
We do want to deprecate functions that return borrowed references, as they make refcounting very complicated. But these constants are effectively tagged pointers now rather than live objects (the refcount is still writable, but properly-built extensions will leave it alone, and they are interpreter- and thread-agnostic), so whether strong or borrowed isn't a big deal.
@mdboom: You didn't vote yet in https://github.com/capi-workgroup/decisions/issues/36#issuecomment-2263232549 - what's your call on this API?
The C API Working Group adopted PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)
API where sep must be a bytes object (cannot be NULL). I close the issue and I will update the PR https://github.com/python/cpython/pull/121646.
API:
PyObject* PyBytes_Join(PyObject *sep, PyObject *iterable)
Similar to
sep.join(iterable)
in Python.sep must be Python
bytes
object.iterable must be an iterable object yielding objects that implement the buffer protocol.
On success, return a new
bytes
object. On error, set an exception and returnNULL
.UPDATE: Don't accept sep=NULL.
It's different than
PyUnicode_Join(NULL, iterable)
which treats NULL separator as a whitespace (' '
). ThisPyUnicode_Join()
behavior is not documented. ThePyUnicode_Join()
documentation only says: