faster-cpython / ideas

1.67k stars 49 forks source link

Can we split `_INIT_CALL_PY_EXACT_ARGS` further in Tier 2? #666

Open gvanrossum opened 3 months ago

gvanrossum commented 3 months ago

_INIT_CALL_PY_EXACT_ARGS is already quite streamlined but we may be able to squeeze an extra bit out of it in the abstract interpreter. In many cases the abstract interpreter can know that the self_or_null input is either always NULL or never NULL. In those cases we could simplify to one the following:

        // always NULL
        replicate(5) pure op(_INIT_CALL_PY_EXACT_ARGS_ALWAYS_NULL, (callable, null, args[oparg] -- new_frame: _PyInterpreterFrame*)) {
            assert(null == NULL);
            (void)null;
            STAT_INC(CALL, hit);
            PyFunctionObject *func = (PyFunctionObject *)callable;
            new_frame = _PyFrame_PushUnchecked(tstate, func, oparg);
            for (int i = 0; i < oparg; i++) {
                new_frame->localsplus[i] = args[i];
            }
        }
        // never NULL
        replicate(5) pure op(_INIT_CALL_PY_EXACT_ARGS_NEVER_NULL, (callable, self, args[oparg] -- new_frame: _PyInterpreterFrame*)) {
            assert(self != NULL);
            STAT_INC(CALL, hit);
            PyFunctionObject *func = (PyFunctionObject *)callable;
            new_frame = _PyFrame_PushUnchecked(tstate, func, oparg + 1);
            new_frame->localsplus[0] = self;
            for (int i = 0; i < oparg; i++) {
                new_frame->localsplus[i+1] = args[i];
            }
        }

It does cost about 10 extra uop instructions, but we seem to have about 77 left (more, if we lower the starting point below 300). It also costs extra special-casing in the abstract interpreter. But the JIT templates ought to become even smaller.

@markshannon @brandtbucher

brandtbucher commented 3 months ago

Template sizes:

So a 10-20% reduction in size.