Add `PyUnicode_Equal()` function

vstinner commented 1 month ago

I propose to add a public PyUnicode_Equal(a, b) function to the limited C API 3.14 to replace the private _PyUnicode_EQ() function:

API: int PyUnicode_Equal(PyObject *a, PyObject *b)

Return 1 if a is equal to b.
Return 0 if a is not equal to b.
Set a TypeError exception and return -1 if a or b is not a Python str object.
The function always succeed if a and b are strings.

Python 3.13 moved the private _PyUnicode_EQ() function to internal C API. mypy and Pyodide are using it.

The existing PyUnicode_Compare() isn't enough and has an issue. PyUnicode_Compare() returns -1 for "less than" but also for the error case. The caller must call PyErr_Occurred() which is inefficient. It causes an ambiguous return value: https://github.com/capi-workgroup/problems/issues/1

PyUnicode_Equal() has no such ambiguous return value (-1 only means error). Moreover, PyUnicode_Equal() may be a little bit faster than PyUnicode_Compare(), but I'm not sure about that.

Vote to add this API:

[x] @encukou
[x] @erlend-aasland
[x] @mdboom
[x] @serhiy-storchaka
[x] @vstinner
[x] @zooba

zooba commented 1 month ago

API shape seems fine, I prefer PyUnicode_IsEqual as a name.

I find the performance argument fairly compelling, since ordering requires a lot of calculation (and realistically requires a number of options, including locale, to be correct). Having a fast equals method is likely to prevent people from relying on interning strings and comparing pointers.

However, if the motivation really is that PyUnicode_Compare has problems, then shouldn't we be creating a better compare function? One that can take an "equals only" flag to be quick, but can also gain more flags (e.g. case insensitive, treat numbers logically, locale-aware ordering, etc.).

Knowing whether the people using the private function were looking specifically for speed or just using the first function their IDE told them about would help us make an informed decision here. I don't like when the proposal doesn't actually solve the problem given as the motivation (but am happy to change the motivation in this case).

zooba commented 1 month ago

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative (perhaps paired with PyUnicode_Check if the caller really doesn't know what types they've got, but almost certainly in most cases where they have enough calls to justify a performance optimisation they're going to be responsible for having created the values, and will know that they're strings already).

ZeroIntensity commented 1 month ago

Not that my opinion matters much here, but I might as well share my thoughts: I think a faster comparison protocol is a great idea (and one I'd be happy to help work on), but probably better suited for api-revolution rather than this proposal. PyUnicode_Equal/PyUnicode_IsEqual is simply a convenient drop in for extensions that were relying on the private API, nothing more, nothing less.

Though, maybe it's going a bit far to add this to the stable ABI -- the main beneficiaries from this are people who weren't using the limited API anyway, so really we're just adding another function to do the same thing, but with less version compatibility than PyUnicode_Compare. Why not mark this as an unstable API? That way, if we do add a better comparison interface in the future, then we won't have to continue maintaining PyUnicode_Equal.

encukou commented 1 month ago

Both projects use it for handling keyword arguments. I expected either that or attribute names (e.g. implementing descriptor-like behaviour in tp_getattro). IMO, that's a better fit for the C API than, say, PyLong_IsNegative, where the uses are more application-logic-y.

I think locale-aware comparisons have no place in the PyUnicode_* namespace. At this level, strings are sequences Unicode codepoints. (FWIW, locale-aware equality involves a lot of computation too. If you want 'é' == 'é' to be True, you should not use a PyUnicode_* function.)

+1 for PyUnicode_Equal. The docs should mention that the function works for str subclasses, but does not honor custom __eq__/tp_richcompare. (We don't have a lot of this kind of info in the docs, but we should start adding it.)

vstinner commented 1 month ago

API shape seems fine, I prefer PyUnicode_IsEqual as a name.

As naming, PyUnicode_EqualToUTF8() function was added to Python 3.13. You might want to use a similar name. But I don't have a strong opinion, English is not my first language :-)

I find the performance argument fairly compelling

I ran a benchmark on strings of 10 characters:

PyUnicode_Equal: Mean +- std dev: 4.57 ns +- 0.11 ns
PyUnicode_Compare: Mean +- std dev: 5.87 ns +- 0.13 ns
PyUnicode_RichCompare: Mean +- std dev: 6.96 ns +- 0.22 ns

PyUnicode_Equal() is the fastest function: 1.3x faster than PyUnicode_Compare(). But we're talking about nanoseconds here: it's only 1.3 ns faster than PyUnicode_Compare().

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative

It returns PyObject* which is less convenient, requires calling Py_DECREF(), and is less efficient.

zooba commented 1 month ago

I ran a benchmark on strings of 10 characters:

On strings that are equal? Or strings that are not equal? The distribution of strings makes a big difference here. If they're all the same length and random, then all three functions will have to compare the first character. If they're different lengths, Equal will be faster, and if they share prefixes the characteristics will change again.

That said, performance in an equality function typically matters the most for when the strings are not equal, since that's going to be the more common tight loop (i.e. when you find the one that matches, you stop searching).

But if you want to say that performance isn't that compelling, then I'll say they can all switch to one of the other functions. This is an easier argument to make if you say that performance is the main reason they need it ;)

Also, we need to rule out PyObject_RichCompareBool as a suitable alternative

It returns PyObject* which is less convenient, requires calling Py_DECREF(), and is less efficient.

You're thinking of PyObject_RichCompare. The ...Bool version returns int.

serhiy-storchaka commented 1 month ago

_PyUnicode_EQ() (and non-exported unicode_eq()) are used in performance-critical code. They never raise an exception, because checking argument type in tight loop when we know that they are strings is a waste. Users of _PyUnicode_EQ() can still prefer to use _PyUnicode_EQ() over PyUnicode_Equal() if the latter adds such checks.

Victor, compare also _PyUnicode_EQ() with PyUnicode_Equal().

To add confusion, there is also much more used (but in less performance-critical code) _PyUnicode_Equal(). It has the same interface and semantic as _PyUnicode_EQ(), but different implementation. I wonder which function is faster for different data.

vstinner commented 1 month ago

Victor, compare also _PyUnicode_EQ() with PyUnicode_Equal().

Here are updated benchmarks: https://github.com/python/cpython/pull/124504#issuecomment-2376471769 (now with CPU isolation).

To add confusion, there is also much more used (but in less performance-critical code) _PyUnicode_Equal(). It has the same interface and semantic as _PyUnicode_EQ(), but different implementation. I wonder which function is faster for different data.

_PyUnicode_Equal() is 1.1x faster than _PyUnicode_EQ() (-1.0 ns) ;-)

serhiy-storchaka commented 1 month ago

How can PyUnicode_Equal() be faster than _PyUnicode_EQ() which it wraps?

vstinner commented 1 month ago

How can PyUnicode_Equal() be faster than _PyUnicode_EQ() which it wraps?

PyUnicode_Equal() is implemented with _PyUnicode_Equal().

We shoud modify _PyUnicode_Equal() and _PyUnicode_EQ() to always use the fastest implementation in both cases. Or even remove of these two functions which are redundant. _PyUnicode_EQ() can no longer be used outside CPython, the function is not exported. Both functions are only part of the internal C API.

serhiy-storchaka commented 1 month ago

Thank you Victor. I think that adding PyUnicode_Equal() for convenience and performance would be nice. There are PyUnicode_EqualToUTF8() and PyUnicode_EqualToUTF8AndSize() in the limited C API, so PyUnicode_Equal() would fit the same row.

But the key point of PyUnicode_EqualToUTF8(), PyUnicode_EqualToUTF8AndSize() and private functions _PyUnicode_Equal() and _PyUnicode_EQ() is that they only return 0/1 and never fail. So you can use them in expression and do not add a boilerplate to check for error. If new PyUnicode_Equal() start to check argument type, raise an exception and return -1, it will lose a large part of its advantage. Type check also adds the 0.4 ns overhead according to your tests.

Having different requirements and guaranties for PyUnicode_EqualToUTF8() and PyUnicode_Equal() would also be error provoking.

vstinner commented 1 month ago

If new PyUnicode_Equal() start to check argument type, raise an exception and return -1, it will lose a large part of its advantage. Type check also adds the 0.4 ns overhead according to your tests.

According to you, what should be the behavior if you compare string to an integer? Return 0? Raise an exception? Fail with a fatal error?

What if both arguments are not strings?

The important part of the API is:

The function always succeed if a and b are strings.

serhiy-storchaka commented 1 month ago

What is the behavior if you compare string to NULL? What is the behavior if you compare non-canonicalized strings (for example with kind=UCS4, but all code points < 0x10000)? This is an undefined behavior.

Very few functions check that the argument is not NULL. Functions usually do not check that strings or integers are in canonized form, they imply that this is true.

vstinner commented 1 month ago

@zooba @encukou: @serhiy-storchaka suggests to declare that comparing objects which are not strings is an undefined behavior for best performance. I suggest to check types and return a TypeError in this case. What's your call on this question?

zooba commented 1 month ago

What's your call on this question?

No undefined behaviour in the limited API. We must check types.

(I would lean towards no undefined behaviour in any public API, including unstable APIs, but I'm prepared to consider exceptional cases. Especially in "unstable". But definitely not in the limited API.)

vstinner commented 1 month ago

On strings that are equal? Or strings that are not equal? The distribution of strings makes a big difference here.

My benchmark was on equal strings of 10 characters.

New benchmark on inequal strings of 10 characters.

PyUnicode_Equal() is always the fastest public function.

zooba commented 1 month ago

PyUnicode_Equal() is always the fastest public function.

Not by enough to make me really excited about it. I guess my gut instinct about perf benefits didn't work this time - perhaps there are other (unavoidable?) overheads.

vstinner commented 1 month ago

Not by enough to make me really excited about it.

Sorry :-) Well, there are cases where it's more interesting.

Different string length

str1 = "1" * 10
str2 = "1"

PyUnicode_Equal: Mean +- std dev: 3.50 ns +- 0.01 ns
PyUnicode_Compare: Mean +- std dev: 10.5 ns +- 0.0 ns
PyUnicode_RichCompare: Mean +- std dev: 8.38 ns +- 0.02 ns

PyUnicode_Equal() is 3.0x faster than PyUnicode_Compare(). It becomes 7.2x faster for strings of 1000 and 1001 characters.

In the case, PyUnicode_Equal() is O(1) and PyUnicode_Compare() is O(n).

Different string kinds (UCS-1 and UCS-2)

str1 = "1" * 9 + "$"
str2 = "1" * 9 + "\u20ac"

PyUnicode_Equal: Mean +- std dev: 3.94 ns +- 0.01 ns
PyUnicode_Compare: Mean +- std dev: 14.0 ns +- 0.0 ns
PyUnicode_RichCompare: Mean +- std dev: 7.88 ns +- 0.01 ns

PyUnicode_Equal() is 3.6x faster than PyUnicode_Compare(). It becomes 171.6x faster for strings of 1000 characters.

In the case, PyUnicode_Equal() is O(1) and PyUnicode_Compare() is O(n).

zooba commented 1 month ago

Oh yeah, I forgot about that case (and definitely remembered it the first time around). Okay, consider me excited about the perf benefits again

serhiy-storchaka commented 1 month ago

Under hood, both PyUnicode_Equal and PyUnicode_RichCompare have the same O(1) complexity for strings with different length or kind. PyUnicode_RichCompare adds more O(1) overhead.

Now, from your comparison of _PyUnicode_Equal and PyUnicode_Equal, the latter is 0.4-0.9 ns slower. Comparing it with your results for strings with different length or kind, PyUnicode_Equal should be 10-30% slower than _PyUnicode_Equal. This is why I complains about type checks. If we maximize performance, this is too high cost.

As for undefined behavior, all functions in the C API have undefined behavior if pass pointers to a freed memory. Many (including all mentioned here functions) have undefined behavior if pass NULL pointers (PyUnicode_Check(NULL) usually crashes). We should not be afraid of undefined behavior if the function is used improperly. It is unavoidable.

vstinner commented 1 month ago

I made a small study on PyUnicode methods in the limited C API. On my small study, 1/3 of functions don't check types, whereas 2/3 check types. PyUnicode_GetLength() which is commonly used checks its argument type.

Don't check arguments type (7):

PyUnicode_Substring()
PyUnicode_AsUCS4()
PyUnicode_AsUCS4Copy()
PyUnicode_CompareWithASCIIString()
PyUnicode_EqualToUTF8()
PyUnicode_EqualToUTF8AndSize()
PyUnicode_IsIdentifier()

Check arguments type (14):

PyUnicode_GetLength()
PyUnicode_ReadChar()
PyUnicode_WriteChar()
PyUnicode_Resize()
PyUnicode_InternInPlace()
PyUnicode_AsWideChar()
PyUnicode_AsWideCharString()
PyUnicode_AsDecodedObject()
PyUnicode_Concat()
PyUnicode_Append()
PyUnicode_Split()
PyUnicode_RichCompare()
PyUnicode_Format()
PyUnicode_Contains()

Now, from your comparison of _PyUnicode_Equal and PyUnicode_Equal, the latter is 0.4-0.9 ns slower.

Compared to PyUnicode_Compare(), the difference with PyUnicode_Equal() is that it is O(1) if string lengths are different or if strings kind are different.

I'm not sure that 0.4-0.9 ns is a big deal for such API. It's already faster than PyUnicode_Compare() and PyUnicode_RichCompare() in all tested cases.

vstinner commented 1 month ago

@erlend-aasland @zooba: Would you mind to vote?

@erlend-aasland: What's your opinion on checking the arguments type? Do you prefer raising TypeError or have an undefined behavior for a little performance overhead?

zooba commented 1 month ago

Voted in favour (still begrudging the name, but I suspect I'm outvoted on adding Is), but I would like to see this addition suggested by Petr:

The docs should mention that the function works for str subclasses, but does not honor custom __eq__/tp_richcompare. (We don't have a lot of this kind of info in the docs, but we should start adding it.)

vstinner commented 1 month ago

I would like to see this addition suggested by Petr:

Oh right, I just added it to the PR to not forget.

encukou commented 1 month ago

[Serhiy] suggests to declare that comparing objects which are not strings is an undefined behavior for best performance. I suggest to check types and return a TypeError in this case. What's your call on this question?

New API should take PyObject*, type-check and raise TypeError if necessary; it should not check for freed pointers or NULL. It's not about avoiding undefined behaviour in general, but about what kind of safety users can expect. (In new API, passing any object as an PyObject* arg is safe; passing NULL or freed memory is not.)

We had this conversation around PyLong_Sign, and in epi-evolution before that, and I don't think anything changed since then. It wasn't my first choice, but that's what we decided; we should stick to it and not revisit it with each new API.

(If necessary, we can add PyUnicode_Equal_Unchecked, with a warning in the name.)

vstinner commented 1 month ago

(If necessary, we can add PyUnicode_Equal_Unchecked, with a warning in the name.)

Honestly, for less than a nanosecond, I don't think that it's worth it to have two functions.

serhiy-storchaka commented 1 month ago

Well, it was not a principled objection.

erlend-aasland commented 1 month ago

@erlend-aasland @zooba: Would you mind to vote?

✅

@erlend-aasland: What's your opinion on checking the arguments type? Do you prefer raising TypeError or have an undefined behavior for a little performance overhead?

I'm share Steve's view: I prefer safe APIs over UB.

vstinner commented 1 month ago

All members voted in favor of PyUnicode_Equal(): the API is approved. Thanks everybody. I close the issue.

capi-workgroup / decisions

Add `PyUnicode_Equal()` function #43

Different string length

Different string kinds (UCS-1 and UCS-2)