Add Py_HashDouble() function

vstinner commented 6 months ago

PR: https://github.com/python/cpython/pull/112449
API: int Py_HashDouble(double value, Py_hash_t *result)
- Set *result to the hash value and return 1 on success.
- Set *result to 0 and return 0 if the hash value cannot be calculated. For example, if value is not-a-number (NaN).
- result must not be NULL
Stable ABI: compatible
Not added to the limited C API yet

UPDATE: I updated the API documation to address @zooba's suggestion (make the API more generic, be less specific about NaN case).

Voters:

[x] Guido Van Rossum @gvanrossum
[x] Irit Katriel @iritkatriel
[x] Steve Dower @zooba
[x] Petr Viktorin @encukou

Since I proposed this API and I'm part of the C API Working Group, I will not vote on this decision.

zooba commented 6 months ago

I'm unclear why this needs to be a public API rather than going through a PyObject. What can you do with a primitive hash value for a primitive value?

Assuming that it does need a dedicated, public API, I prefer the return values as described in this post (and not the other variants that were tried), though I would prefer if they were described as "returns 1 if result is valid; otherwise 0" and says elsewhere that NaN cannot be hashed. It should also clearly say that it does not set a Python exception (and if it did, then I think the result should be negative).

vstinner commented 6 months ago

I'm unclear why this needs to be a public API rather than going through a PyObject

The rationale is explained in the issue. numpy currently uses the _Py_HashDouble() function: https://github.com/numpy/numpy/issues/25035

float_arrtype_hash() calls Npy_HashDouble() with (double)PyArrayScalar_VAL(obj, @name@) value:

https://github.com/numpy/numpy/blob/742920aef552b27806094f8ed6462d3cf1744785/numpy/_core/src/multiarray/scalartypes.c.src#L3629-L3633

I don't see how existing PyObject_Hash() API can be used for that.

Assuming that it does need a dedicated, public API, I prefer the return values as described in this post (and not the other variants that were tried), though I would prefer if they were described as "returns 1 if result is valid; otherwise 0" and says elsewhere that NaN cannot be hashed.

Isn't the proposed API?

Technically, the function returns 0 for not-a-number and it can be used as the hash value for some use cases. Hash conflict can lead to inefficient hash table but it's usually correct. If you want to mimick Python hash(float), you should call Py_HashPointer(obj) if the C double is not-a-number.

It should also clearly say that it does not set a Python exception (and if it did, then I think the result should be negative).

The function cannot fail: it cannot return 0 and it cannot set an exception. It can fix the doc to make it more explicit if needed.

zooba commented 6 months ago

numpy currently uses the _Py_HashDouble() function

So the rationale is: "libraries that implement their own numeric type that needs to compare equal to the built-in float type may use this to implement their tp_hash function".

I don't see how existing PyObject_Hash() API can be used for that.

The object_arrtype_hash function a few lines below your link does it. If you need an identical hash to an existing Python type, then you convert the value to that type and then call PyObject_Hash.

Isn't the proposed API?

Yes, but the way you describe it isn't a good way to document it. The return value should reflect the validity of the result or post conditions, rather than the values of the arguments. That way we retain freedom to change the reasons it may fail without having to change the documentation (and follow a deprecation process).

It's the difference between:

"returns 0 when you pass NaN" (bad - too specific)
"returns 0 when it couldn't calculate a hash value" (good)

And in the latter case, we can also document that we can't calculate hash values for NaN. But that's disconnected from the definition of the result, so if it one day turns out we can't calculate a hash for some other value then we can start returning an error without having to create a new function. That may be unlikely in this particular case, but the general principle applies elsewhere, and the docs here should be consistent with the other places where it is more likely that over-specifying the function could get us into trouble.

Again, it's just grammar. But in a specification, which is how our documentation is used, the grammar matters.

I can fix the doc to make it more explicit if needed.

Yes. Fixing the doc is what I was asking for. Don't assume that users will read our source code.

vstinner commented 6 months ago

"returns 0 when it couldn't calculate a hash value" (good)

In Python <= 3.9, hash(float("nan")) was just 0. It's not as if Python "cannot calculate" a hash value. Or that the hash value 0 will put your computer on fire. When proposed API return 0, it really means "not-a-number": as a hint that you might want to use a different hash value if you want.

In Python <= 3.9, all NaN float had a same hash value, even if they were different Python objects:

$ python3.9
Python 3.9.18 (main, Aug 28 2023, 00:00:00) 
>>> import math
>>> x=math.inf-math.inf
>>> y=math.inf-math.inf
>>> x is y
False
>>> hash(x), hash(y)
(0, 0)

So if these Python float objects are used as dict keys for example, hash collisions were more likely. The Python dict type is fine with hash collision, since NaN is not equal to NaN:

>>> x == y
False
>>> d={x: 1, y: 2}
>>> len(d)
2
>>> d[x], d[y]
(1, 2)
>>> d
{nan: 1, nan: 2}

Python 3.10 modified hash(float) by using object.__hash__(obj) for NaN floats to reduce hash collisions.

But maybe for some other usages, you might want to use something else.

There are 3 categories of float values:

finite: return (more or less) hash(int(value)) (in an optimized way) -- it can be equal to 0. IEEE 754 sub-normals values fall into this category as well.
infinite: return sys.hash_info.inf or -sys.hash_info.inf.
NaN: return 0.

Note: maybe we should have this discussion on https://github.com/python/cpython/pull/112449 PR or https://github.com/python/cpython/issues/111545 issue.

vstinner commented 6 months ago

If proposed API is confusing, an alternative is Py_hash_t Py_HashDouble(double value, int *is_nan). It might be less efficient if we accept NULL for is_nan. But such API may be less confusing.

Or it can be just Py_hash_t Py_HashDouble(double value) and the caller would have to call isnan(value) explicitly. It's less efficient, since Py_HashDouble() already calls isnan(value).

zooba commented 6 months ago

The proposed API isn't confusing, I just disagree with how you describe it.

We don't say "if <parameter> contains a specific value, the return value is 0". We say "if the result cannot be calculated, the return value is 0. One reason the result may not be calculated is if <parameter> contains a specific value."

Do you see the difference between those two statements?

vstinner commented 6 months ago

The proposed API isn't confusing, I just disagree with how you describe it.

I thought that the vote was restricted on the API.

If you have concerns on the implementation (doc, tests, whatever), I just suggest to move the discussion on the PR instead. I'm sure that the doc can be enhanced :-)

zooba commented 6 months ago

I have a concern with your API design.

That concern is that you designed: "when the input is NaN, returns 0".

I can tell this is your design because of the documentation.

I think your design should be: "when it fails to generate a hash, returns 0".

I'll be able to tell that you've fixed the design when the documentation is changed.

This is all about API design. The documentation you've written is how I can tell what you've designed.

Once it's released, we can't just change the documentation, because it will change the definition of the API in a subtly incompatible way. It doesn't affect the current implementation, but it could in the future, so I want to address it now so that we don't run into issues in the future.

Please don't treat this process as just a checkbox before you do whatever you want. We're trying to collaborate to produce a consistent, reliable, forward-compatible API that doesn't get us into trouble later on. If you think I'm just arbitrarily impeding you, feel free to make a complaint to the SC and get me removed from the WG - we have a process for this now.

vstinner commented 6 months ago

Well, I was thinking aloud since I'm not used to the C API Working Group. In short, I proposed to update the PR to address your concern, and then come back to the decision. I didn't propose to make a decision and change the API afterwards.

I can tell this is your design because of the documentation.

It's kind to refer to this API as "my design", but it is not mine 😁. It was a collective work on issue https://github.com/python/cpython/issues/111545. I only tried to lead the work to have a full change (implementation, doc, tests) addressing all concerns. I prefer to rely on other IEEE 754 experts as Mark Dickinshon who is the latest author who authored a major redesign of all hash() functions for "numbers (int, float, complex, Decimal) to have consistent hash values for all stdlib number types.

My concern is more that the PR was already approved twice, and now we are discussing further changes. It's a new situation to me, I don't know if the C API Working Group should take a decision, and then the PR is updated. Or the opposite: the WG suggests change, the PR is updated and maybe I wait to new approval, and then the WG makes a decision.

That being said, I applied your documentation suggestion in the description of this issue: https://github.com/capi-workgroup/decisions/issues/2#issue-2026844923 I'm fine with it, it's reasonable to make the API more generic, and less specific about "NaN" (only mention it as "one possible case" of "cannot calculate the hash value").

First, I expected "the API" to be int Py_HashDouble(double value, Py_hash_t *result) and nothing else. But I agree that the description of the parameters, output, error cases, etc. are part of the API. Well, that's also why I added it to this issue :-) So the API is the function definition and its documentation.

zooba commented 6 months ago

With the clarified return value semantics, I'm happy with this design.

vstinner commented 6 months ago

@gvanrossum, @iritkatriel, @encukou: so, what do you think of this API? Tell me if you need more details.

iritkatriel commented 6 months ago

LGTM. I left some comments on documentation on the PR.

gvanrossum commented 6 months ago

I am not reviewing the PR (no time) but I am okay with the name and spec of the new API, given that there is a real need (e.g. Numpy).

encukou commented 6 months ago

LGTM too; the new documentation is much better. Thank you!

vstinner commented 6 months ago

We agreed on int Py_HashDouble(double value, Py_hash_t *result), but Serhiy Storchaka and Mark Dickinson prefer the following API: https://github.com/python/cpython/pull/112449#issuecomment-1852435549

        Py_hash_t
        hash_double(PyObject *obj, double value)
        {
            if (Py_IS_NAN(value)) {
                return Py_HashPointer(obj);
            }
            else {
                return Py_HashDouble(value);
            }
        }

How should I handle this situation? Enforce C API Working Group decision? Reopen the discussion here? I respect Serhiy and Mark's opinion on IEEE 754 float and hashing since they are expert on this topic.

encukou commented 6 months ago

Try to come to an agreement. If it doesn't happen, the working group should decide. I'd be happy to be convinced by IEEE experts. But in this case I don't know what users would pass in as PyObject *obj, given that API is, AFAIK, for cases where you don't have a PyDouble object.

zooba commented 6 months ago

Yeah, if you've already got a PyObject * for the value, then just use PyObject_Hash on it.

This API is supposedly for situations where you have a float/double that is not yet a PyObject, and you don't want to construct one until you know it's necessary (and you've implemented a native collection type that can do hash and equality comparisons without needing a real PyObject - in other words, this is already a very specialised API that most people will never use).

vstinner commented 6 months ago

The Py_hash_t Py_HashDouble(double value) API is being discussed at: https://github.com/python/cpython/pull/112449#issuecomment-1852435549

vstinner commented 6 months ago

@gvanrossum @iritkatriel @zooba @encukou: I'm sorry, I would like to invite you to reconsider this decision! Mark Dickinson and Serhiy Storchaka would prefer Py_hash_t Py_HashDouble(double value) API:

Mark is the author of the unified hash implementation of all numbers in Python (int, float, complex, Decimal). He has a good background on mathematics and IEEE 754 floating point numbers. Serhiy also seems to have a good background in mathematics and floating points. They understand concepts such as rounding, NaN, ULP and other things better than me :-)

I propose to change the API to: Py_hash_t Py_HashDouble(double value)

If value is positive infinity, return sys.hash_info.inf.
If value is negative infinity, return -sys.hash_info.inf.
If value is not-a-number (NaN), return sys.hash_info.nan (0).
Otherwise, return the hash value of the finite value number.

This function cannot fail.

Second vote:

[x] Guido Van Rossum @gvanrossum
[x] Irit Katriel @iritkatriel
[x] Steve Dower @zooba
[ ] Petr Viktorin @encukou

See PR #113115 which implements this API.

3 weeks ago, I ran a micro-benchmark on Py_HashDouble():

API int Py_HashDouble(double v, Py_hash_t *result): 13.2 ns +- 0.0 ns
API Py_hash_t Py_HashDouble(double v): 12.3 ns +- 0.0 ns

So this API is a little bit faster: 13.2 ns => 12.3 ns (-0.9 ns).

gvanrossum commented 6 months ago

Okay on the simpler API.

encukou commented 6 months ago

-1 for that simpler API. IMO, it's too easy to misuse if you're not an expert floats. See my comment on the issue.

gvanrossum commented 6 months ago

Anything involving floats is easy to misuse if you're not an expert in floats.

On Mon, Dec 18, 2023 at 6:23 AM Petr Viktorin @.***> wrote:

-1 for that simpler API. IMO, it's too easy to misuse if you're not an expert floats. See my comment on the issue https://github.com/python/cpython/pull/112449#issuecomment-1860630376.

— Reply to this email directly, view it on GitHub https://github.com/capi-workgroup/decisions/issues/2#issuecomment-1860633470 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWCWMX3OZM3H2PBLGSRYUDYKBGUVBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE3TENZWHEZDENRQQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRSGAZDMOBUGQ4TEM5HORZGSZ3HMVZKMY3SMVQXIZI . You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

-- --Guido van Rossum (python.org/~guido)

gvanrossum commented 6 months ago

Random thought: Maybe hash() of a NaN should hash based on the bits of the NaN rather than on the id() of the object? Then Py_HashDouble(double) could do the right thing for NaNs. Or is __eq__ for NaN also tied to the object id()?

zooba commented 6 months ago

I think someone had a scenario where they wanted to keep a lot of "distinct" NaNs as keys in a dict without facing all the hash collisions.

As a scenario, it seems a bit of a stretch, but it shouldn't really matter for this case. All that id() is used for is randomisation, which means we could simply say that Py_HashDouble(NaN) returns some fixed/specific value (as currently proposed) and the caller will need to randomise it if they care about hash collisions. __eq__ for NaN should always return false, even for itself[^1] and so the actual hash value is irrelevant, provided it is unlikely to be the same for two "distinct" NaNs.

[^1]: Though I wouldn't be surprised if we did an is check first for performance, to save even looking up __eq__.

gvanrossum commented 6 months ago

So I think the desire here is to have an API that works given just the double, even if that double contains a NaN. The code contains this comment on the current strategy for NaN (by Raymond):

   NaNs hash with a pointer hash.  Having distinct hash values prevents
   catastrophic pileups from distinct NaN instances which used to always
   have the same hash value but would compare unequal.

For good measure, here's the PR: https://github.com/python/cpython/pull/25493, and the issue: https://github.com/python/cpython/issues/87641 (bpo: https://bugs.python.org/issue43475). The issue discussion goes on forever but only seems to consider the choice between a constant hash value for all NaNs and a a hash based on hash(id(x)) (it also considers fixing NaN equality, but not seriously). Basing the hash on all bits of the value doesn't seem to be considered.

Anyway, I like the idea of returning some fixed value if the input is a NaN, and letting the caller sort out using the object id hash, if they care. (I imagine that this function is used in numpy to hash float/double values that don't have a corresponding object yet, e.g. when calculating the hash of an array? In that case requiring an object to be passed in isn't really helpful.)

ngoldbaum commented 6 months ago

I imagine that this function is used in numpy to hash float/double values that don't have a corresponding object yet, e.g. when calculating the hash of an array? In that case requiring an object to be passed in isn't really helpful.

Sorry to chime in from the peanut gallery. I'm a numpy developer and happened to come across this thread this evening.

Numpy has a concept of scalar types that are distinct from a zero-dimensional array. One big difference from arrays, which are mutable, is that scalars are immutable and hashable, and wrap a single C data value. Numpy currently uses _Py_HashDouble to implement tp_hash for the float and complex scalar types.

gvanrossum commented 6 months ago

Could you link to the code? I’m curious what object is passed in (especially in the complex case).

ngoldbaum commented 6 months ago

For the complex case the real and imaginary hashes are combined:

https://github.com/numpy/numpy/blob/021787a600cee9350234185f71a6349db83c699f/numpy/_core/src/multiarray/scalartypes.c.src#L3638

This is in a templated source file using NumPy's weird custom templating system but you can get the gist if you squint.

Search elsewhere in this file for Npy_HashDouble, which is a thin wrapper around _Py_HashDouble.

gvanrossum commented 6 months ago

Okay, so that exists just to do the same thing as CPython. So if we made a better API, we’d just force numpy to change the code there (or in Npy_HashDouble, whereas if we kept the same API but made it public, they’d have to do almost nothing (just a conditional rename). So let’s do the latter.

--Guido (mobile)

On Wed, Dec 20, 2023 at 19:19 Nathan Goldbaum @.***> wrote:

For the complex case the real and imaginary hashes are combined:

https://github.com/numpy/numpy/blob/021787a600cee9350234185f71a6349db83c699f/numpy/_core/src/multiarray/scalartypes.c.src#L3638

This is in a templated source file using NumPy's weird custom templating system but you can get the gist if you squint.

Search elsewhere in this file for Npy_HashDouble, which is a thin wrapper around _Py_HashDouble.

— Reply to this email directly, view it on GitHub https://github.com/capi-workgroup/decisions/issues/2#issuecomment-1865419756 or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWCWMRLX5LYUOV6BBKGKKLYKOTD7BFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVE3TENZWHEZDENRQQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRSGAZDMOBUGQ4TEM5HORZGSZ3HMVZKMY3SMVQXIZI . You are receiving this email because you commented on the thread.

Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub .

mdickinson commented 6 months ago

@gvanrossum

Basing the hash on all bits of the value doesn't seem to be considered.

It doesn't solve the problem in https://github.com/python/cpython/issues/87641: in the problematic situation we have a large number of NaN objects, all of which have the same NaN bit pattern (the bit pattern of np.nan) but different ids. So a hash based on the NaN bit pattern would still end up generating large numbers of hash table collisions.

There's a mass of compromises here, and no really good solution. I think the "right" solution is to regard all NaNs as equal for the purposes of dict/set containment (or perhaps as a refinement to regard NaNs with the same bit pattern as equal), so that a dict or set can only have a single NaN in it. But even leaving backwards compatibility concerns aside, I don't see how we special-case NaNs in the general-purpose dict/set/hash machinery without introducing new special methods and some sort of third "for use in containment checks" equality relation that's distinct from both == and is. (I suppose we could actually special-case the float type, but where does that leave things like Decimal?) Any general-purpose solution seems too heavyweight to solve a problem that's really only caused by one value in one type.

gvanrossum commented 6 months ago

So @mdickinson do you agree that we should just promote something with the same (object*, double) signature?

mdickinson commented 6 months ago

do you agree that we should just promote something with the same (object*, double) signature?

Sure, that seems fine to me. I'd also be perfectly happy with the latest version of @vstinner's proposal (as in this comment); for me, the dangers of returning 0 for NaNs aren't big enough to disqualify that proposal.

gvanrossum commented 6 months ago

It’s just more effort to adapt existing code that needs compatibility with Python numeric values. And that’s why this API exists.

encukou commented 5 months ago

That gets me thinking that _Py_HashDouble was... fine as it was. It exists to provide what CPython does, the way CPython does it; and if its signature or implementation ever changes, all of its callers should at least look at their calls to see if they're still valid.

Why not give it the modern name, PyUnstable_HashDouble, and let PyFloat_Type->tp_hash be the API for general use? (And if we do that, then per PEP 689 the underscored name should stay as an alias – so NumPy doesn't need any code change.)

The downside is that this doesn't help the API graduate into the stable ABI – but the conversations I'd like to have for that would look quite far off the mark here. (mostly: What should this do on implementations without IEEE floats?)

vstinner commented 5 months ago

let PyFloat_Type->tp_hash be the API for general use?

PyFloat_Type->tp_hash() takes a Python float object, whereas the proposed API takes a C double. The proposed API is to implement a compatible hash function to float-like types which don't inherit from Python float type.

vstinner commented 5 months ago

(mostly: What should this do on implementations without IEEE floats?)

Building Python requires IEEE 754 (with not-a-number) since Python 3.11: https://docs.python.org/dev/using/configure.html

vstinner commented 5 months ago

I propose to change the API to: Py_hash_t Py_HashDouble(double value)

Votes: 4 people are in favor of this API (me included), Petr is against it.

Then another (object*, double) API was discussed. Guido: I'm now confused if you are ok with Py_hash_t Py_HashDouble(double value) or not.

Also, PEP 731 doesn't mention specific guidelines for votes. Should I have to reach 100% majority (no "veto" vote)? Or is 4 "+1" and 1 "-1" means that the "group" accepts the proposed API?

gvanrossum commented 5 months ago

I think the most helpful thing for current users is to pass both a double and an object, so I am now in favor of that. I think I explained my reasoning above.

encukou commented 5 months ago

I guess I don't fully understand the (double, object) proposal, in the sense that I couldn't write good docs for it: what is that object? AFAIK it's only been explained in terms of the implementation: if the double is not hashable, Py_HashDouble will return a hash of the object. But what should you pass in? It seems it's meant to be some “container” that holds the float value (like a Python float, or a NumPy scalar/matrix). Is that right? That would mean float values can only be hashed if they're in a PyObject* “container”. Is that a reasonable limitation of Python's float hashing algorithm?

Also, PEP 731 doesn't mention specific guidelines for votes. Should I have to reach 100% majority (no "veto" vote)? Or is 4 "+1" and 1 "-1" means that the "group" accepts the proposed API?

Either way, the original proposal here would win. I think our first draft of a formal voting process isn't really working in this issue.

Building Python requires IEEE 754 (with not-a-number) since Python 3.11

Yup, but that's CPython. I would like stable ABI to be usable across implementations (particularly because far-future versions of CPython essentially are alternate implementations). But as I said, that's a whole other conversation.

vstinner commented 5 months ago

@gvanrossum:

I think the most helpful thing for current users is to pass both a double and an object, so I am now in favor of that.

Hum, it seems like we are going into circles since that was my first proposed API, Petr and Mark Dickinson didn't like it so I proposed other APIs.

So far, I proposed 3 APIs as 3 PR.

API A (double+obj)

Py_hash_t PyHash_Double(double value, PyObject *obj) => https://github.com/python/cpython/pull/112095

PyHash_Double(value, obj) returns _Py_HashPointer(obj) if value is not-a-number.
PyHash_Double(value, NULL) is valid and returns _PyHASH_NAN (0) if value is not-a-number

@mdickinson (comment) and @encukou (comment) didn't like the API. @encukou asked for:

int PyHash_Double(double value, Py_hash_t *result)
// -> 1 for non-NaN (*result is set to the hash)
// -> 0 for NaN (*result is set to 0)

@serhiy-storchaka liked (comment) this API.

Later, Guido proposed again such API and so is in favor of it. Mark wrote "Sure, that seems fine to me".

Updates

There were discussions about functions which "cannot fail": https://github.com/capi-workgroup/api-evolution/issues/43

There were also discussions on the function name: PyHash_Double() vs Py_HashDouble().

Benchmarks were also run on the different APIs.

I added Py_hash_t Py_HashPointer(const void *ptr) and I wrote tests on the existing PyHash_GetFuncDef() API.

API B (return int)

int Py_HashDouble(double value, Py_hash_t *result) => https://github.com/python/cpython/pull/112449

Return 0 if value is not-a-number (and set *result to 0).
Set *result to the hash value and return 1 otherwise.

@mdickinson and @encukou approved the PR. So I created this issue for the C API Working Group and shortly the Working Group approved this API as well.

@serhiy-storchaka suggested a different API: Py_hash_t hash_double(PyObject *obj, double value) (more or less API 1).

API C (double => Py_hash_t)

Some people were not fully satistified with API B, so I proposed a new one!

Py_hash_t Py_HashDouble(double value) => https://github.com/python/cpython/pull/113115

@serhiy-storchaka approved it. @mdickinson prefers this API.

@encukou dislikes this API: he sees it as being too error-prone.

Ranking

Let me try to summarize people preferences on these API by raking them.

APIs:

API A: Py_hash_t PyHash_Double(double value, PyObject *obj)
API B: int Py_HashDouble(double value, Py_hash_t *result)
API C: Py_hash_t Py_HashDouble(double value)

Ranking:

@mdickinson: API C > API B > API A
@serhiy-storchaka: API C > API A -- dislike API B
@encukou: API B -- dislike API A and API C
@vstinner: API C > API B > API A
@gvanrossum: API A > API B
@iritkatriel is fine with API B and API C (no preference expressed)
@zooba is fine with API B and API C (no preference expressed)

Attempt to list supporters per API (+2 for first preferred API, +1 for second preferred, no vote for 3rd preferred):

API A (2 persons, +3)
- +2 @gvanrossum
- +1 @serhiy-storchaka but he prefers API C
API B (6 persons: +7)
- +2 @encukou
- +1 @mdickinson
- +1 @vstinner
- +1 @gvanrossum
- +1 @iritkatriel
- +1 @zooba
API C (5 persons: +8)
- +2 @mdickinson
- +2 @serhiy-storchaka
- +2 @vstinner
- +1 @iritkatriel
- +1 @zooba

I'm not sure that I collected correctly preferences of each person involved in the discussions.

It seems like Irit and Steve only looked at API B and then API C proposed here, they didn't express their opinion about API A.

Variants

For API A, I proposed to allow passing NULL for the object, which makes the API a "superset" of API C.

Petr considers that API C is error-prone since developers don't read the documentation and so can be affected by high hash collision if there are many not-a-number stored in a hash table / dict. Others people don't see it as a blocker issue.

For API C, example of recipe for numpy:

Py_hash_t hash_double(double value, PyObject *obj)
{
    if (Py_IS_NAN(value)) {
        return Py_HashPointer(obj);
    }
    else {
        return Py_HashDouble(value);
    }
}

gvanrossum commented 5 months ago

It seems I hadn't actually explained my reason for changing my preference back to API A properly.

The evidence is in this comment from a numpy maintainer: https://github.com/capi-workgroup/decisions/issues/2#issuecomment-1865419756

This shows how the object is used (the object passed in may be a larger object whose hash is being computed, as in the example of the hash of a numpy complex number).

My point is that only API A makes it simple for numpy and others like it to switch from using an internal API to the new public API using a simply #define. All other proposals would require them to maintain two separate variants of their hashing code, one for 3.13 and one for earlier versions.

zooba commented 5 months ago

If API A is clearly said to be "for implementing tp_hash on a float-like type", that the object parameter should be the instance, and that the object parameter is allowed to be NULL (in which case the result when it was necessary is either 0 or an identifiable error code distinct from any other failure reason), I'm okay with that one.

I'm still only barely convinced that this is worth adding a public API, even an unstable one. It's still going to be totally valid to construct an actual float from the value and get its hash, so we have a slow+compatible option already. If the perf is important, copying the function in its entirety and keeping it updated in new versions should be worth the effort. (Though I admit this is mostly about setting a boundary - we do not yet have a problem of making public C API for every single micro-function just for someone who wants both speed and compatibility, but I worry that may become more of a problem.)

gvanrossum commented 5 months ago

If the perf is important, copying the function in its entirety and keeping it updated in new versions should be worth the effort.

This doesn't sound right to me. Perf is definitely important for numpy when doing this for an array (which they do; complex was just the simplest example). The numpy team shouldn't be required to watch the code in that particular file for changes. The API is for sure more stable than the implementation, so they can't get a compiler warning for this -- they'd have to implement a test that dynamically compares the value of the hashes they compute to the hashes computed by the CPython runtime for a wide variety of floats, and that seems just the wrong way about it when we can just export the proper API. Also note that we effectively already export the proper API (albeit as a private function) -- this problem is almost entirely of our own making since we decided we want to make it hard for 3rd parties to use private functions.

serhiy-storchaka commented 5 months ago

How to compute the hash of multicomponent value? You call Py_HashDouble() for every component and combine results. If several components are NaNs, API A will give you the same hash for them. Combining a hash with itself can give more predicable result if you use something simple (like add or xor). You also repeatedly calculate the hash of the pointer.

API B and API C allow to stop when you find a NaN and return Py_HashPointer(obj). With API B you check the result of the call, with API C you check for NaN before the call.

ngoldbaum commented 5 months ago

Perf is definitely important for numpy when doing this for an array (which they do; complex was just the simplest example).

For what it's worth, for numpy and this API specifically, we would only be calling this function if someone needs the hash of a float or complex scalar, e.g. if they make one a dict key. Arrays don't have hashes because they are mutable.

gvanrossum commented 5 months ago

Ah, together those are a compelling argument not to care so much about either perf or convenience for numpy. So then I'm okay with all three APIs.

encukou commented 5 months ago

a compelling argument not to care so much about either perf or convenience for numpy

Indeed. If either of those was important, my preferences would be very different :)

For a summary, rather than focus on the history of the proposals, I'd rather focus on the proposals themselves -- specifically, the remaining reasons people have to not prefer the API. It's hard to judge, as people's understanding and opinions evolve. Maybe it's something like:

API A (double, PyObject *obj) -> Py_hash_t
- assumes a container -- a loss in generality.
API B (double, Py_hash_t *result) -> int
- is slower.
API C (double) -> Py_hash_t
- hides a potentially dangerous edge case.

Anything missing?

If API B's slowness is an issue, I'll be happy to change my preferences (after trying a faster implementation).

mdickinson commented 5 months ago

@encukou

assumes a container -- a loss in generality

I think there may be a misunderstanding here. In the NumPy use-case, there's a scalar value (e.g., of type np.float32) and we want to compute a hash for it in a way that's consistent with equality (e.g., equality with Python floats). That scalar value is a Python object, containing a raw C float. For API A, NumPy would pass that float (cast to double) as the first argument, and the wrapper object as the second argument. I don't think there's any container here (except in the same sense that a Python float can be regarded as a container for a single C double).

mdickinson commented 5 months ago

Apologies; after rereading, the misunderstanding is mine.

except in the same sense that a Python float can be regarded as a container for a single C double

... and that's exactly the sense of container that you meant. Mea culpa.

vstinner commented 5 months ago

IMO the 3 discussed APIs are fine, there are no huge difference between them. Well, some people have opinions about one specific aspect of these APIs: report error, performance, etc.

Now, how can we take a decision to move on?

capi-workgroup / decisions