Performance of accessing consts from localsplus

iritkatriel commented 1 year ago

I created this branch which:

copies the consts to the end of the localsplus array as we plan to do for the regmachine
makes LOAD_CONST and KW_NAMES take the consts from there
does not have a consts local in the ceval loop

The benchmark results came back with a 2% slowdown. Did I run them correctly? Can we understand/improve that result?

iritkatriel commented 1 year ago

It seems that about 11-12% of the code objects never access any consts. Maybe the compiler can detect this and not copy the consts in that case?

iritkatriel commented 1 year ago

Note that we often have a consts array with just the docstring in this case. We considered in the past to move the docstring off the co_consts into a separate field.

gvanrossum commented 1 year ago

We're going to replace PyTuple_GetSize() with PyTuple_GET_SIZE() and re-run the benchmarks.

We also have a few other ideas:

We could reduce the size of the constants array by e.g. having a RETURN_NONE opcode. We speculate that this is often the only constant used in a function (needs checking, all we know is that None is by far the most common constant).
We could lazily copy the constants array, e.g. the compiler could emit a INITIALIZE_CONSTANTS instruction just before the first instruction that needs a constant. This requires some control flow analysis. It would be useful in functions that have a fast path that doesn't use constants and a slow path (e.g. cache miss or error) that does.
After that we could try to remove the docstring from the constants tuple (we had a discussion about that in the past, it was possible but a bit complex and we decided not to do it, maybe it's worth it now).

iritkatriel commented 1 year ago

With PyTuple_GET_SIZE it's 3% slower.

gvanrossum commented 1 year ago

Odd. That sounds like benchmark noise.

Is there an option to force re-running the "reference" benchmark (i.e. where your branch forked off main)? @mdboom

Anyway, we can talk tomorrow about implications of this negative result.

mdboom commented 1 year ago

Is there an option to force re-running the "reference" benchmark (i.e. where your branch forked off main)? @mdboom

It doesn't ever re-run a base commit if it already has that commit hash. However, if you specify the base's commit hash directly in its own run, it will overwrite the old results. (I'm not saying that's good UI design, but it would work with what we have today).

iritkatriel commented 1 year ago

To isolate the cost of copying the consts to the frame locals array, I ran the benchmarks over a branch where we do that, but LOAD_CONST and KW_NAMES still take the value from co_consts as before. This showed a 2% slowdown.

To check if this is benchmarking noise, I am now running the benchmarks on a branch in my repo with no changes compared to main.

iritkatriel commented 1 year ago

I think I've been looking at this in the wrong way.

Copying the consts in bulk at the beginning of the function is supposed to replace all the LOAD_CONST instructions. The performance gain comes from (1) copying each const once, no matter how many times it is accessed and (2) copying them in bulk.

So while it is perhaps not realistic to think that copying the consts will somehow be free, it is cheaper than all the individual LOAD_CONSTs, and we will see a gain once we remove them.

iritkatriel commented 1 year ago

The idea of a bytecode that lazily initialises the consts before the first const is loaded is not easy to get right/efficient. Suppose that the first const access is in a loop. Maybe the loop won't execute at all, so we don't want an INITIALIZE_CONSTANTS before the loop. But we also don't want to execute INITIALIZE_CONSTANTS inside the loop.

gvanrossum commented 1 year ago

Couldn’t it have a flag on the frame that it’s done it already?

But agreed we should put that off until we need it.

iritkatriel commented 1 year ago

Couldn’t it have a flag on the frame that it’s done it already?

Sure, it won't memcpy again but you still pay for dispatch and checking the flag.

gvanrossum commented 1 year ago

Yup, so better lift it out of the loop, even if sometimes it doesn’t iterate.

faster-cpython / ideas

Performance of accessing consts from localsplus #542