Open gvanrossum opened 3 months ago
FWIW, we already do this with list.append
(CALL_LIST_APPEND
). The same pattern could be followed for any other builtin methods without too much hassle.
It is in the debug build, but startswith
is actually faster:
$ ./python -m timeit -s 's = "abcdef"' 's.startswith("abc")'
2000000 loops, best of 5: 189 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[:3] == "abc"'
1000000 loops, best of 5: 283 nsec per loop
I think @JelleZijlstra showed that in a fully optimized build it is still slower. I repeated his experiments and got the same results:
(.venv) ~$ python3.13 -m timeit -s 'x = "......"' 'x[:3] == "..."'
10000000 loops, best of 5: 32 nsec per loop
(.venv) ~$ python3.13 -m timeit -s 'x = "......"' 'x[:3].startswith("...")'
5000000 loops, best of 5: 64.4 nsec per loop
(.venv) ~$ python3.13 -V
Python 3.13.0a5
EDIT: The second timing is wrong, see corrected timing below.
Your second timing does both the slice and method call. :)
Your second timing does both the slice and method call. :)
Oof. :-( But it's still slower: 40 vs. 32:
(.venv) ~$ python3.13 -m timeit -s 'x = "......"' 'x.startswith("...")'
5000000 loops, best of 5: 40.3 nsec per loop
Yes, for comparison with https://github.com/faster-cpython/ideas/issues/671#issuecomment-2030202497, on the same computer:
$ ./python -m timeit -s 's = "abcdef"' 's.startswith("abc")'
5000000 loops, best of 5: 84.3 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[:3] == "abc"'
5000000 loops, best of 5: 65.9 nsec per loop
startswith
is 25-30% slower.
The difference is larger for 1-character prefix (because the overhead of making a 1-character substring is much smaller):
$ ./python -m timeit -s 's = "abcdef"' 's.startswith("a")'
5000000 loops, best of 5: 83.6 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[:1] == "a"'
5000000 loops, best of 5: 51 nsec per loop
$ ./python -m timeit -s 's = "abcdef"' 's[0] == "a"'
10000000 loops, best of 5: 22.3 nsec per loop
There is also some overhead in the argument parsing for startswith
. On my system
(env312) python -m timeit -s "s = 'abcdef'" "s.startswith('abc')"
5000000 loops, best of 5: 63.8 nsec per loop
(env312) python -m timeit -s "s = 'abcdef'" "s[:3] == 'abc'"
5000000 loops, best of 5: 47.6 nsec per loop
Adding a simple fast path for the single argument case of startswith
results in
(env312) python -m timeit -s "s = 'abcdef'" "s.startswith('abc')"
5000000 loops, best of 5: 42.3 nsec per loop
The fast path I implemented is https://github.com/python/cpython/compare/main...eendebakpt:cpython:startswith. That might not be the best approach, but this was just a quick test. The asciilib_parse_args_finds
could perhaps also be improved, for example eliminate the copy at https://github.com/python/cpython/blob/c741ad3537193c63fe697a8f0316aecd45eeb9ba/Objects/stringlib/find.h#L97
The reason is that startswith requires a name lookup.
This is not true. It is the call that is slow, not the attribute lookup.
The attribute lookup gets specialized to LOAD_ATTR_METHOD_NO_DICT
which is as fast as an attribute lookup can be (in tier 1).
The call is slow for two reasons.
str.startwith
is a METH_VARARGS
function. It should be METH_FASTCALL
def foo(s):
for _ in range(1000):
s.startswith("foo")
>>> dis.dis(foo, adaptive=True)
1 RESUME_CHECK 0
2 LOAD_GLOBAL 1 (range + NULL)
LOAD_CONST 1 (1000)
CALL 1
GET_ITER
L1: FOR_ITER_RANGE 20 (to L2)
STORE_FAST 1 (_)
3 LOAD_FAST 0 (s)
LOAD_ATTR_METHOD_NO_DICT 3 (startswith + NULL|self)
LOAD_CONST 2 ('foo')
CALL 1
POP_TOP
JUMP_BACKWARD 22 (to L1)
2 L2: END_FOR
POP_TOP
RETURN_CONST 0 (None)
We should probably replace all METH_VARARGS
with METH_FASTCALL
, in the long run.
In the short term, we should change the methods on builtin objects.
git grep METH_VARARGS | grep "Objects/"
gives me 49 results, so it wouldn't be that much work.
We should probably replace all
METH_VARARGS
withMETH_FASTCALL
, in the long run.
FYI: With python/cpython#107984, Argument Clinic got the ability to deprecate keyword use of parameters. Oh, they're already positional-only. Sorry for the noise.
METH_FASTCALL
gives a significant speedup[^1]:
$ ./python.exe -m timeit -s "s = 'abcdef'" "s[:3] == 'abc'"
2000000 loops, best of 5: 177 nsec per loop
$ ./python.exe -m timeit -s "s = 'abcdef'" "s.startswith('abc')"
5000000 loops, best of 5: 58.3 nsec per loop
[^1]: debug build
@eendebakpt Can you turn that into a PR then?
@eendebakpt Can you turn that into a PR then?
Yes, I will pick that up
@erlend-aasland Did you use argument clinic in testing the METH_FASTCALL?
@eendebakpt: yes. See my draft PR.
Resolved by:
Argument Clinic (and METH_FASTCALL) FTW.
@eendebakpt is also working on further optimisations in stringlib 🚀
With the recent conversion to vectorcall by Erland there is not much to gain in the implementation of startswith
(and endswith
). The computation time for x.startswith('a')
can be split into the following components (all numbers are rough estimates, details in the section below)
startswith
on x
: 2.7 nsa
: 1.5 nsstart
and end
parameters, type checking: 2.6 nsThe implementation of the actual string comparison can be improved a bit, for the single character case a few ns. See #117782
Performance of x.startswith('a', 0, 10)
is about 54.3 ns, which seems high. A large part of that time is in the preparation of the three arguments. A small part is in the argument parsing, in particular the calls to _PyEval_SliceIndex
. That method can be improved, since in the common path (e.g. indices of type int
) we can avoid an incref/decref and some type checks.
With this https://github.com/python/cpython/compare/main...eendebakpt:cpython:pynumber performance improves
x.startswith('a', 0, 10): Mean +- std dev: [main] 54.4 ns +- 1.1 ns -> [p] 52.3 ns +- 0.3 ns: 1.04x faster
The performance gain is not very large, but PyNumber_AsSsize_t
is used in many places so things might be worthwhile. The implementation is short and localized, but not very pretty as it adds an ad-hoc variable do_decref
. I am not yet sure how to implement this better, so feedback is welcome.
People keep discovering that, embarrassingly, checking whether a string starts with a given substring is slower with
s.startswith("foo")
than withs[:3] == "foo"
. The reason is thatstartswith
requires a name lookup. I think we now have the machinery to special-case selected built-in method in the specializer? This would be a good candidate.