faster-cpython / ideas

1.67k stars 49 forks source link

Performance of accessing consts from localsplus #542

Closed iritkatriel closed 1 year ago

iritkatriel commented 1 year ago

I created this branch which:

The benchmark results came back with a 2% slowdown. Did I run them correctly? Can we understand/improve that result?

iritkatriel commented 1 year ago

It seems that about 11-12% of the code objects never access any consts. Maybe the compiler can detect this and not copy the consts in that case?

iritkatriel commented 1 year ago

Note that we often have a consts array with just the docstring in this case. We considered in the past to move the docstring off the co_consts into a separate field.

gvanrossum commented 1 year ago

We're going to replace PyTuple_GetSize() with PyTuple_GET_SIZE() and re-run the benchmarks.

We also have a few other ideas:

iritkatriel commented 1 year ago

With PyTuple_GET_SIZE it's 3% slower.

gvanrossum commented 1 year ago

Odd. That sounds like benchmark noise.

Is there an option to force re-running the "reference" benchmark (i.e. where your branch forked off main)? @mdboom

Anyway, we can talk tomorrow about implications of this negative result.

mdboom commented 1 year ago

Is there an option to force re-running the "reference" benchmark (i.e. where your branch forked off main)? @mdboom

It doesn't ever re-run a base commit if it already has that commit hash. However, if you specify the base's commit hash directly in its own run, it will overwrite the old results. (I'm not saying that's good UI design, but it would work with what we have today).

image

iritkatriel commented 1 year ago

To isolate the cost of copying the consts to the frame locals array, I ran the benchmarks over a branch where we do that, but LOAD_CONST and KW_NAMES still take the value from co_consts as before. This showed a 2% slowdown.

To check if this is benchmarking noise, I am now running the benchmarks on a branch in my repo with no changes compared to main.

iritkatriel commented 1 year ago

I think I've been looking at this in the wrong way.

Copying the consts in bulk at the beginning of the function is supposed to replace all the LOAD_CONST instructions. The performance gain comes from (1) copying each const once, no matter how many times it is accessed and (2) copying them in bulk.

So while it is perhaps not realistic to think that copying the consts will somehow be free, it is cheaper than all the individual LOAD_CONSTs, and we will see a gain once we remove them.

iritkatriel commented 1 year ago

The idea of a bytecode that lazily initialises the consts before the first const is loaded is not easy to get right/efficient. Suppose that the first const access is in a loop. Maybe the loop won't execute at all, so we don't want an INITIALIZE_CONSTANTS before the loop. But we also don't want to execute INITIALIZE_CONSTANTS inside the loop.

gvanrossum commented 1 year ago

Couldn’t it have a flag on the frame that it’s done it already?

But agreed we should put that off until we need it.

iritkatriel commented 1 year ago

Couldn’t it have a flag on the frame that it’s done it already?

Sure, it won't memcpy again but you still pay for dispatch and checking the flag.

gvanrossum commented 1 year ago

Yup, so better lift it out of the loop, even if sometimes it doesn’t iterate.