Speeding up Python-to-Python calls.

markshannon commented 4 months ago

There a number of things we can do to speedup Python-to-Python calls, without changing the stack layout.

Faster creation of interpreter frames

Our fastest Python-to-Python call is _INIT_CALL_PY_EXACT_ARGS which is reasonably efficient, but could definitely be made faster. There are few issues with it.

It contains a variable length loop.
The inlined call to _PyFrame_PushUnchecked also contains a variable length loop.

We can make the loops fixed length by:

Unconditionally copying self_or_null
Only adjust the pointer, not the count if self is not NULL. If self_or_null is NULL it will then be overwritten.
Break _INIT_CALL_PY_EXACT_ARGS into two parts, one to initialize the arguments and one to NULL out the remaining locals. Both can be marked replicate to avoid the loop.

Better optimization of other Py-to-Py calls in tier 2

We currently specialize the remaining Py-to-Py calls into "with defaults" and do not specialize for "code complex parameters". We should treat both the same in tier 1 as "CALL_PY", and expand the call sequence in tier2 to produce an optimal sequence of instructions.

This will probably make no difference to T1 performance, the "with defaults" case will get a tiny bit slower and the other cases might be a bit faster.

Remove `f_globals` and `f_builtins` from the interpreter frame

In tier 2, we have largely eliminated access to f_globals and f_builtins. We can speedup calls, without slowing down access to globals, in tier 2 if we were to remove these fields. Doing this will slowdown access to globals in tier 1, however.

In order to get an overall speedup the ratio of tier 2 to tier 1 code will need to increase. Once the ratio of T2 to T1 code is 3:1 or better, it should be profitable to remove these fields.

Faster checking of stack space

https://github.com/faster-cpython/ideas/issues/620

markshannon commented 4 months ago

We can replace _CHECK_FUNCTION_EXACT_ARGS with _CHECK_FUNCTION if we add function watchers, or use code watchers to watch for the code object being replaced.

markshannon commented 4 months ago

Also https://github.com/faster-cpython/ideas/issues/577

gvanrossum commented 4 months ago

@markshannon Would using memcpy() and memset() help at all? Presumably for small constant sizes this will generate optimal inlined code, and for large or variable sizes it will fall back to optimized assembly. I find it easier to read too (when I see a for loop I have to read it to see that it's just copying data -- when I see memcpy I already know).

Another thought is that we already know during specialization whether self_or_null is definitively null or non-null. The specializer checks this and emits different code. It would be simple to have two variants of _CHECK_FUNCTION_EXACT_ARGS and _INIT_CALL_PY_EXACT_ARGS, one with self_or_null != NULL, one with self_or_null == NULL. Then you wouldn't have to split _INIT_CALL_PY_EXACT_ARGS.

I like the idea of generating optimal sequences for calls with defaults and/or "complex parameters" (within reason -- the spec is fiendishly complex).

markshannon commented 3 months ago

Would using memcpy() and memset() help at all?

The compiler already inserts calls to memcpy and memset, but that is trading one inefficiency for another.

Another thought is that we already know during specialization whether self_or_null is definitively null or non-null.

During tier 2 optimization we sometimes know, but never in tier 1.

gvanrossum commented 3 months ago

During tier 2 optimization we sometimes know, but never in tier 1.

But we almost know in in tier 1 -- _SPECIALIZE_CALL knows it, but doesn't pass the info explicitly to _Py_Specialize_Call() -- it just adds it to the arg count. If it passed that as a separate flag, we could specialize based on that bit.

faster-cpython / ideas