Closed vstinner closed 1 month ago
API shape seems fine, I prefer PyUnicode_IsEqual
as a name.
I find the performance argument fairly compelling, since ordering requires a lot of calculation (and realistically requires a number of options, including locale, to be correct). Having a fast equals method is likely to prevent people from relying on interning strings and comparing pointers.
However, if the motivation really is that PyUnicode_Compare
has problems, then shouldn't we be creating a better compare function? One that can take an "equals only" flag to be quick, but can also gain more flags (e.g. case insensitive, treat numbers logically, locale-aware ordering, etc.).
Knowing whether the people using the private function were looking specifically for speed or just using the first function their IDE told them about would help us make an informed decision here. I don't like when the proposal doesn't actually solve the problem given as the motivation (but am happy to change the motivation in this case).
Also, we need to rule out PyObject_RichCompareBool
as a suitable alternative (perhaps paired with PyUnicode_Check
if the caller really doesn't know what types they've got, but almost certainly in most cases where they have enough calls to justify a performance optimisation they're going to be responsible for having created the values, and will know that they're strings already).
Not that my opinion matters much here, but I might as well share my thoughts: I think a faster comparison protocol is a great idea (and one I'd be happy to help work on), but probably better suited for api-revolution
rather than this proposal. PyUnicode_Equal
/PyUnicode_IsEqual
is simply a convenient drop in for extensions that were relying on the private API, nothing more, nothing less.
Though, maybe it's going a bit far to add this to the stable ABI -- the main beneficiaries from this are people who weren't using the limited API anyway, so really we're just adding another function to do the same thing, but with less version compatibility than PyUnicode_Compare
. Why not mark this as an unstable API? That way, if we do add a better comparison interface in the future, then we won't have to continue maintaining PyUnicode_Equal
.
Both projects use it for handling keyword arguments. I expected either that or attribute names (e.g. implementing descriptor-like behaviour in tp_getattro
).
IMO, that's a better fit for the C API than, say, PyLong_IsNegative
, where the uses are more application-logic-y.
I think locale-aware comparisons have no place in the PyUnicode_*
namespace. At this level, strings are sequences Unicode codepoints.
(FWIW, locale-aware equality involves a lot of computation too. If you want 'é' == 'é'
to be True, you should not use a PyUnicode_*
function.)
+1 for PyUnicode_Equal
.
The docs should mention that the function works for str
subclasses, but does not honor custom __eq__
/tp_richcompare
. (We don't have a lot of this kind of info in the docs, but we should start adding it.)
API shape seems fine, I prefer PyUnicode_IsEqual as a name.
As naming, PyUnicode_EqualToUTF8()
function was added to Python 3.13. You might want to use a similar name. But I don't have a strong opinion, English is not my first language :-)
I find the performance argument fairly compelling
I ran a benchmark on strings of 10 characters:
PyUnicode_Equal
: Mean +- std dev: 4.57 ns +- 0.11 nsPyUnicode_Compare
: Mean +- std dev: 5.87 ns +- 0.13 nsPyUnicode_RichCompare
: Mean +- std dev: 6.96 ns +- 0.22 nsPyUnicode_Equal()
is the fastest function: 1.3x faster than PyUnicode_Compare()
. But we're talking about nanoseconds here: it's only 1.3 ns faster than PyUnicode_Compare()
.
Also, we need to rule out PyObject_RichCompareBool as a suitable alternative
It returns PyObject*
which is less convenient, requires calling Py_DECREF(), and is less efficient.
I ran a benchmark on strings of 10 characters:
On strings that are equal? Or strings that are not equal? The distribution of strings makes a big difference here. If they're all the same length and random, then all three functions will have to compare the first character. If they're different lengths, Equal will be faster, and if they share prefixes the characteristics will change again.
That said, performance in an equality function typically matters the most for when the strings are not equal, since that's going to be the more common tight loop (i.e. when you find the one that matches, you stop searching).
But if you want to say that performance isn't that compelling, then I'll say they can all switch to one of the other functions. This is an easier argument to make if you say that performance is the main reason they need it ;)
Also, we need to rule out PyObject_RichCompareBool as a suitable alternative
It returns
PyObject*
which is less convenient, requires calling Py_DECREF(), and is less efficient.
You're thinking of PyObject_RichCompare
. The ...Bool
version returns int.
_PyUnicode_EQ()
(and non-exported unicode_eq()
) are used in performance-critical code. They never raise an exception, because checking argument type in tight loop when we know that they are strings is a waste. Users of _PyUnicode_EQ()
can still prefer to use _PyUnicode_EQ()
over PyUnicode_Equal()
if the latter adds such checks.
Victor, compare also _PyUnicode_EQ()
with PyUnicode_Equal()
.
To add confusion, there is also much more used (but in less performance-critical code) _PyUnicode_Equal()
. It has the same interface and semantic as _PyUnicode_EQ()
, but different implementation. I wonder which function is faster for different data.
Victor, compare also _PyUnicode_EQ() with PyUnicode_Equal().
Here are updated benchmarks: https://github.com/python/cpython/pull/124504#issuecomment-2376471769 (now with CPU isolation).
To add confusion, there is also much more used (but in less performance-critical code) _PyUnicode_Equal(). It has the same interface and semantic as _PyUnicode_EQ(), but different implementation. I wonder which function is faster for different data.
_PyUnicode_Equal() is 1.1x faster than _PyUnicode_EQ() (-1.0 ns) ;-)
How can PyUnicode_Equal()
be faster than _PyUnicode_EQ()
which it wraps?
How can PyUnicode_Equal() be faster than _PyUnicode_EQ() which it wraps?
PyUnicode_Equal()
is implemented with _PyUnicode_Equal()
.
We shoud modify _PyUnicode_Equal()
and _PyUnicode_EQ()
to always use the fastest implementation in both cases. Or even remove of these two functions which are redundant. _PyUnicode_EQ()
can no longer be used outside CPython, the function is not exported. Both functions are only part of the internal C API.
Thank you Victor. I think that adding PyUnicode_Equal()
for convenience and performance would be nice. There are PyUnicode_EqualToUTF8()
and PyUnicode_EqualToUTF8AndSize()
in the limited C API, so PyUnicode_Equal()
would fit the same row.
But the key point of PyUnicode_EqualToUTF8()
, PyUnicode_EqualToUTF8AndSize()
and private functions _PyUnicode_Equal()
and _PyUnicode_EQ()
is that they only return 0/1 and never fail. So you can use them in expression and do not add a boilerplate to check for error. If new PyUnicode_Equal()
start to check argument type, raise an exception and return -1, it will lose a large part of its advantage. Type check also adds the 0.4 ns overhead according to your tests.
Having different requirements and guaranties for PyUnicode_EqualToUTF8()
and PyUnicode_Equal()
would also be error provoking.
If new PyUnicode_Equal() start to check argument type, raise an exception and return -1, it will lose a large part of its advantage. Type check also adds the 0.4 ns overhead according to your tests.
According to you, what should be the behavior if you compare string to an integer? Return 0? Raise an exception? Fail with a fatal error?
What if both arguments are not strings?
The important part of the API is:
The function always succeed if a and b are strings.
What is the behavior if you compare string to NULL? What is the behavior if you compare non-canonicalized strings (for example with kind=UCS4, but all code points < 0x10000)? This is an undefined behavior.
Very few functions check that the argument is not NULL. Functions usually do not check that strings or integers are in canonized form, they imply that this is true.
@zooba @encukou: @serhiy-storchaka suggests to declare that comparing objects which are not strings is an undefined behavior for best performance. I suggest to check types and return a TypeError in this case. What's your call on this question?
What's your call on this question?
No undefined behaviour in the limited API. We must check types.
(I would lean towards no undefined behaviour in any public API, including unstable APIs, but I'm prepared to consider exceptional cases. Especially in "unstable". But definitely not in the limited API.)
On strings that are equal? Or strings that are not equal? The distribution of strings makes a big difference here.
My benchmark was on equal strings of 10 characters.
New benchmark on inequal strings of 10 characters.
PyUnicode_Equal()
is always the fastest public function.
PyUnicode_Equal()
is always the fastest public function.
Not by enough to make me really excited about it. I guess my gut instinct about perf benefits didn't work this time - perhaps there are other (unavoidable?) overheads.
Not by enough to make me really excited about it.
Sorry :-) Well, there are cases where it's more interesting.
str1 = "1" * 10
str2 = "1"
PyUnicode_Equal() is 3.0x faster than PyUnicode_Compare(). It becomes 7.2x faster for strings of 1000 and 1001 characters.
In the case, PyUnicode_Equal() is O(1) and PyUnicode_Compare() is O(n).
str1 = "1" * 9 + "$"
str2 = "1" * 9 + "\u20ac"
PyUnicode_Equal() is 3.6x faster than PyUnicode_Compare(). It becomes 171.6x faster for strings of 1000 characters.
In the case, PyUnicode_Equal() is O(1) and PyUnicode_Compare() is O(n).
Oh yeah, I forgot about that case (and definitely remembered it the first time around). Okay, consider me excited about the perf benefits again
Under hood, both PyUnicode_Equal
and PyUnicode_RichCompare
have the same O(1) complexity for strings with different length or kind. PyUnicode_RichCompare
adds more O(1) overhead.
Now, from your comparison of _PyUnicode_Equal
and PyUnicode_Equal
, the latter is 0.4-0.9 ns slower. Comparing it with your results for strings with different length or kind, PyUnicode_Equal
should be 10-30% slower than _PyUnicode_Equal
. This is why I complains about type checks. If we maximize performance, this is too high cost.
As for undefined behavior, all functions in the C API have undefined behavior if pass pointers to a freed memory. Many (including all mentioned here functions) have undefined behavior if pass NULL pointers (PyUnicode_Check(NULL)
usually crashes). We should not be afraid of undefined behavior if the function is used improperly. It is unavoidable.
I made a small study on PyUnicode methods in the limited C API. On my small study, 1/3 of functions don't check types, whereas 2/3 check types. PyUnicode_GetLength() which is commonly used checks its argument type.
Don't check arguments type (7):
Check arguments type (14):
Now, from your comparison of _PyUnicode_Equal and PyUnicode_Equal, the latter is 0.4-0.9 ns slower.
Compared to PyUnicode_Compare(), the difference with PyUnicode_Equal() is that it is O(1) if string lengths are different or if strings kind are different.
I'm not sure that 0.4-0.9 ns is a big deal for such API. It's already faster than PyUnicode_Compare() and PyUnicode_RichCompare() in all tested cases.
@erlend-aasland @zooba: Would you mind to vote?
@erlend-aasland: What's your opinion on checking the arguments type? Do you prefer raising TypeError or have an undefined behavior for a little performance overhead?
Voted in favour (still begrudging the name, but I suspect I'm outvoted on adding Is
), but I would like to see this addition suggested by Petr:
The docs should mention that the function works for
str
subclasses, but does not honor custom__eq__
/tp_richcompare
. (We don't have a lot of this kind of info in the docs, but we should start adding it.)
I would like to see this addition suggested by Petr:
Oh right, I just added it to the PR to not forget.
[Serhiy] suggests to declare that comparing objects which are not strings is an undefined behavior for best performance. I suggest to check types and return a TypeError in this case. What's your call on this question?
New API should take PyObject*
, type-check and raise TypeError
if necessary; it should not check for freed pointers or NULL. It's not about avoiding undefined behaviour in general, but about what kind of safety users can expect. (In new API, passing any object as an PyObject*
arg is safe; passing NULL or freed memory is not.)
We had this conversation around PyLong_Sign
, and in epi-evolution before that, and I don't think anything changed since then. It wasn't my first choice, but that's what we decided; we should stick to it and not revisit it with each new API.
(If necessary, we can add PyUnicode_Equal_Unchecked
, with a warning in the name.)
(If necessary, we can add PyUnicode_Equal_Unchecked, with a warning in the name.)
Honestly, for less than a nanosecond, I don't think that it's worth it to have two functions.
Well, it was not a principled objection.
@erlend-aasland @zooba: Would you mind to vote?
✅
@erlend-aasland: What's your opinion on checking the arguments type? Do you prefer raising TypeError or have an undefined behavior for a little performance overhead?
I'm share Steve's view: I prefer safe APIs over UB.
All members voted in favor of PyUnicode_Equal(): the API is approved. Thanks everybody. I close the issue.
I propose to add a public
PyUnicode_Equal(a, b)
function to the limited C API 3.14 to replace the private_PyUnicode_EQ()
function:API:
int PyUnicode_Equal(PyObject *a, PyObject *b)
1
if a is equal to b.0
if a is not equal to b.TypeError
exception and return-1
if a or b is not a Pythonstr
object.Python 3.13 moved the private
_PyUnicode_EQ()
function to internal C API. mypy and Pyodide are using it.The existing
PyUnicode_Compare()
isn't enough and has an issue.PyUnicode_Compare()
returns-1
for "less than" but also for the error case. The caller must callPyErr_Occurred()
which is inefficient. It causes an ambiguous return value: https://github.com/capi-workgroup/problems/issues/1PyUnicode_Equal()
has no such ambiguous return value (-1
only means error). Moreover,PyUnicode_Equal()
may be a little bit faster thanPyUnicode_Compare()
, but I'm not sure about that.Vote to add this API: