Add `Py_hash_t Py_HashBuffer(const void*, Py_ssize_t)`

pitrou commented 9 months ago

CPython has an internal API Py_hash_t _Py_HashBytes(const void*, Py_ssize_t) that implements hashing of a buffer of bytes, consistently with the hash output of the bytes object. It was added (by me) in https://github.com/python/cpython/commit/ce4a9da70535b4bb9048147b141f01004af2133d

It is currently used internally for hashing bytes objects (of course), but also str objects, memoryview objects, some datetime objects, and a couple other duties:

>>> hash(b"xxx")
-7933225254679225263
>>> hash(memoryview(b"xxx"))
-7933225254679225263
>>> bio = io.BytesIO()
>>> bio.write(b"xxx")
3
>>> bio.getbuffer()
<memory at 0x7f288ed29a80>
>>> hash(bio.getbuffer().toreadonly())
-7933225254679225263

Third-party libraries may want to define buffer-like objects and ensure that they are hashable in a way that's compatible with built-in bytes objects. Currently this would mean relying on the aforementioned internal API. An example I'm familiar with is the Buffer object in PyArrow.

>>> import pyarrow as pa
>>> buf = pa.py_buffer(b"xxx")
>>> buf
<pyarrow.Buffer address=0x7f28e05bdd00 size=3 is_cpu=True is_mutable=False>
>>> buf == b"xxx"
True
>>> hash(buf)
Traceback (most recent call last):
  ...
TypeError: unhashable type: 'pyarrow.lib.Buffer'

I simply propose making the API public and renaming it to Py_HashBytes, such that third-party libraries have access to the same facility.

encukou commented 9 months ago

This is necessary to implement a hashable object that compares equal to a bytes (or memoryview) with the same contents. Going a bit further: if we encourage allowing such comparisons, like PyArrow.buffer(b'xxx') == b'xxx' == MyCustomBuffer(b'xxx'), then we should encourage that PyArrow.buffer(b'xxx') == MyCustomBuffer(b'xxx'). (And of course the hashes need to match too.) This means that comparison methods of buffer classes should, when given an other argument of an unknown type, ask other for a buffer and compare its contents to what self would export. And if they do that, they should use _Py_HashBytes for hashing.

I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.

vstinner commented 9 months ago

What's the behavior for negative length? Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.

I would be fine with Py_hash_t Py_HashBytes(const void *data, size_t size) API. FYI it returns 0 is size is equal to zero, but I would prefer to not put it in the API description.

Other examples of size_t parameters:

PyAPI_FUNC(PyObject*) PyLong_FromNativeBytes(const void* buffer, size_t n_bytes, int endianness);: recent function
static inline size_t _PyObject_SIZE(PyTypeObject *type): i converted it to a static inline function recently. The implementation is way easier if it returns an unsigned size.
void* PyObject_Malloc(size_t size)
PyAPI_FUNC(wchar_t *) Py_DecodeLocale(const char *arg, size_t *size);
PyAPI_FUNC(int) PyOS_snprintf(char *str, size_t size, const char *format, ...)

In the C library, size_t is common. Examples:

void *memcpy(void *dest, const void *src, size_t n);
ssize_t write(int fd, const void *buf, size_t count);: return type is interesting here :-)
int pthread_attr_setstacksize(pthread_attr_t *attr, size_t stacksize);

pitrou commented 9 months ago

Currently, it seems like negative length is cast to unsigned length and so have funny behavior. I suggest to change the length type to unsigned size_t.

All objects lengths in Python are signed, including the Py_buffer::len member.

I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)

vstinner commented 9 months ago

I don't really care either way, but insisting on size_t seems a bit gratuitous to me :-)

If you want to keep Py_ssize_t, I would ask the behavior for negative length to be defined, and different than "random crash" if possible.

pitrou commented 9 months ago

Negative length could for example be clamped to 0, and perhaps coupled with a debug-mode assertion?

vstinner commented 9 months ago

I'm fine with Py_HashBytes(data, -5) returning 0 (rather than crashing). We can document that the length must be positive :-) Sure an assertion is welcomed.

vstinner commented 9 months ago

I'd name the public function Py_HashBuffer, even though it doesn't take the entire Py_Buffer struct, to emphasize that it's meant to hash data you'd export using the buffer protocol.

If the parameter type is not Py_buffer, IMO Py_HashBuffer() name is misleading and I prefer proposed Py_HashBytes() name.

encukou commented 9 months ago

It's not a PyBytes object either. The reasoning behind Py_HashBuffer is that you should use the same data that you expose with bf_getbuffer, otherwise you get surprising behaviour.

@pitrou, I see you've added a thumbs-up to my comment. IMO, the docs will be fairly important in this case; I'd really like to frame it as “implementing equality & hashes for (immutable) buffer objects” rather than “here's a function you can use”. Do you want to draft it, or should I try my hand?

Negatives aren't the only case of invalid sizes. IMO it's OK if Py_HashBuffer(data, -1) has undefined behaviour, like e.g. Py_HashBuffer(data, SIZE_MAX/4) would (on common systems). A debug assertion is fine.

pitrou commented 8 months ago

Perhaps something like:

.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)

   Compute and return the hash value of a buffer of *len* bytes
   starting at address *ptr*. This hash value is guaranteed to be
   equal to the hash value of a :class:`bytes` object with the same
   contents.

   This function is meant to ease implementation of hashing for
   immutable objects providing the :ref:`buffer protocol <bufferobjects>`.

encukou commented 8 months ago

I meant something like:

.. c:function:: Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len)

   Compute and return the hash value of a buffer of *len* bytes
   starting at address *ptr*. The hash is guaranteed to match that of
   :class:`bytes`, :class:`memoryview`, and other built-in objects
   that implement the :ref:`buffer protocol <bufferobjects>`.

   Use this function to implement hashing for immutable
   objects whose `tp_richcompare` function compares
   to another object's buffer.

But the details can be hashed out in review.

encukou commented 8 months ago

Well, I think this is overdue for a vote:

[x] @encukou: +1
[x] @gvanrossum
[x] @iritkatriel
[x] @vstinner: +1
[x] @zooba

vstinner commented 8 months ago

Ping @iritkatriel who didn't vote.

iritkatriel commented 8 months ago

Sorry.

vstinner commented 8 months ago

@pitrou: You can go ahead with Py_hash_t Py_HashBuffer(const void* ptr, Py_ssize_t len) API (and proposed documentation), it's approved by the C API working group. I close the issue.

vstinner commented 3 months ago

I created https://github.com/python/cpython/issues/122854 to implement the function.

capi-workgroup / decisions

Add `Py_hash_t Py_HashBuffer(const void*, Py_ssize_t)` #13