Closed pitrou closed 8 months ago
This is necessary to implement a hashable object that compares equal to a bytes
(or memoryview
) with the same contents.
Going a bit further: if we encourage allowing such comparisons, like PyArrow.buffer(b'xxx') == b'xxx' == MyCustomBuffer(b'xxx')
, then we should encourage that PyArrow.buffer(b'xxx') == MyCustomBuffer(b'xxx')
. (And of course the hashes need to match too.)
This means that comparison methods of buffer classes should, when given an other
argument of an unknown type, ask other
for a buffer and compare its contents to what self
would export. And if they do that, they should use _Py_HashBytes
for hashing.
I'd name the public function Py_HashBuffer
, even though it doesn't take the entire Py_Buffer
struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.
What's the behavior for negative length? Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t
.
I would be fine with Py_hash_t Py_HashBytes(const void *data, size_t size)
API. FYI it returns 0 is size is equal to zero, but I would prefer to not put it in the API description.
Other examples of size_t
parameters:
PyAPI_FUNC(PyObject*) PyLong_FromNativeBytes(const void* buffer, size_t n_bytes, int endianness);
: recent functionstatic inline size_t _PyObject_SIZE(PyTypeObject *type)
: i converted it to a static inline function recently. The implementation is way easier if it returns an unsigned size.void* PyObject_Malloc(size_t size)
PyAPI_FUNC(wchar_t *) Py_DecodeLocale(const char *arg, size_t *size);
PyAPI_FUNC(int) PyOS_snprintf(char *str, size_t size, const char *format, ...)
In the C library, size_t is common. Examples:
void *memcpy(void *dest, const void *src, size_t n);
ssize_t write(int fd, const void *buf, size_t count);
: return type is interesting here :-)int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);
Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.
All objects lengths in Python are signed, including the Py_buffer::len
member.
I don't really care either way, but insisting on size_t
seems a bit gratuitous to me :-)
I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)
If you want to keep Py_ssize_t
, I would ask the behavior for negative length to be defined, and different than "random crash" if possible.
Negative length could for example be clamped to 0, and perhaps coupled with a debug-mode assertion?
I'm fine with Py_HashBytes(data, -5)
returning 0 (rather than crashing). We can document that the length must be positive :-) Sure an assertion is welcomed.
I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.
If the parameter type is not Py_buffer
, IMO Py_HashBuffer()
name is misleading and I prefer proposed Py_HashBytes()
name.
It's not a PyBytes
object either. The reasoning behind Py_HashBuffer
is that you should use the same data that you expose with bf_getbuffer
, otherwise you get surprising behaviour.
@pitrou, I see you've added a thumbs-up to my comment. IMO, the docs will be fairly important in this case; I'd really like to frame it as “implementing equality & hashes for (immutable) buffer objects” rather than “here's a function you can use”. Do you want to draft it, or should I try my hand?
Negatives aren't the only case of invalid sizes. IMO it's OK if Py_HashBuffer(data, -1)
has undefined behaviour, like e.g. Py_HashBuffer(data, SIZE_MAX/4)
would (on common systems). A debug assertion is fine.
Perhaps something like:
.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)
Compute and return the hash value of a buffer of *len* bytes
starting at address *ptr*. This hash value is guaranteed to be
equal to the hash value of a :class:`bytes` object with the same
contents.
This function is meant to ease implementation of hashing for
immutable objects providing the :ref:`buffer protocol <bufferobjects>`.
I meant something like:
.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)
Compute and return the hash value of a buffer of *len* bytes
starting at address *ptr*. The hash is guaranteed to match that of
:class:`bytes`, :class:`memoryview`, and other built-in objects
that implement the :ref:`buffer protocol <bufferobjects>`.
Use this function to implement hashing for immutable
objects whose `tp_richcompare` function compares
to another object's buffer.
But the details can be hashed out in review.
Well, I think this is overdue for a vote:
Ping @iritkatriel who didn't vote.
Sorry.
@pitrou: You can go ahead with Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)
API (and proposed documentation), it's approved by the C API working group. I close the issue.
I created https://github.com/python/cpython/issues/122854 to implement the function.
CPython has an internal API
Py_hash_t _Py_HashBytes(const void*, Py_ssize_t)
that implements hashing of a buffer of bytes, consistently with the hash output of thebytes
object. It was added (by me) in https://github.com/python/cpython/commit/ce4a9da70535b4bb9048147b141f01004af2133dIt is currently used internally for hashing
bytes
objects (of course), but alsostr
objects,memoryview
objects, somedatetime
objects, and a couple other duties:Third-party libraries may want to define buffer-like objects and ensure that they are hashable in a way that's compatible with built-in
bytes
objects. Currently this would mean relying on the aforementioned internal API. An example I'm familiar with is theBuffer
object in PyArrow.I simply propose making the API public and renaming it to
Py_HashBytes
, such that third-party libraries have access to the same facility.