Open markshannon opened 4 months ago
We can replace _CHECK_FUNCTION_EXACT_ARGS
with _CHECK_FUNCTION
if we add function watchers, or use code watchers to watch for the code object being replaced.
@markshannon Would using memcpy()
and memset()
help at all? Presumably for small constant sizes this will generate optimal inlined code, and for large or variable sizes it will fall back to optimized assembly. I find it easier to read too (when I see a for
loop I have to read it to see that it's just copying data -- when I see memcpy
I already know).
Another thought is that we already know during specialization whether self_or_null
is definitively null or non-null. The specializer checks this and emits different code. It would be simple to have two variants of _CHECK_FUNCTION_EXACT_ARGS
and _INIT_CALL_PY_EXACT_ARGS
, one with self_or_null != NULL
, one with self_or_null == NULL
. Then you wouldn't have to split _INIT_CALL_PY_EXACT_ARGS
.
I like the idea of generating optimal sequences for calls with defaults and/or "complex parameters" (within reason -- the spec is fiendishly complex).
Would using memcpy() and memset() help at all?
The compiler already inserts calls to memcpy
and memset
, but that is trading one inefficiency for another.
Another thought is that we already know during specialization whether self_or_null is definitively null or non-null.
During tier 2 optimization we sometimes know, but never in tier 1.
During tier 2 optimization we sometimes know, but never in tier 1.
But we almost know in in tier 1 -- _SPECIALIZE_CALL
knows it, but doesn't pass the info explicitly to _Py_Specialize_Call()
-- it just adds it to the arg count. If it passed that as a separate flag, we could specialize based on that bit.
There a number of things we can do to speedup Python-to-Python calls, without changing the stack layout.
Faster creation of interpreter frames
Our fastest Python-to-Python call is
_INIT_CALL_PY_EXACT_ARGS
which is reasonably efficient, but could definitely be made faster. There are few issues with it._PyFrame_PushUnchecked
also contains a variable length loop.We can make the loops fixed length by:
self_or_null
self
is not NULL. Ifself_or_null
isNULL
it will then be overwritten._INIT_CALL_PY_EXACT_ARGS
into two parts, one to initialize the arguments and one toNULL
out the remaining locals. Both can be markedreplicate
to avoid the loop.Better optimization of other Py-to-Py calls in tier 2
We currently specialize the remaining Py-to-Py calls into "with defaults" and do not specialize for "code complex parameters". We should treat both the same in tier 1 as "CALL_PY", and expand the call sequence in tier2 to produce an optimal sequence of instructions.
This will probably make no difference to T1 performance, the "with defaults" case will get a tiny bit slower and the other cases might be a bit faster.
Remove
f_globals
andf_builtins
from the interpreter frameIn tier 2, we have largely eliminated access to
f_globals
andf_builtins
. We can speedup calls, without slowing down access to globals, in tier 2 if we were to remove these fields. Doing this will slowdown access to globals in tier 1, however.In order to get an overall speedup the ratio of tier 2 to tier 1 code will need to increase. Once the ratio of T2 to T1 code is 3:1 or better, it should be profitable to remove these fields.
Faster checking of stack space
https://github.com/faster-cpython/ideas/issues/620