Investigation: Why are pystats changing in unexpected ways?

mdboom commented 4 months ago

We are still seeing unexpected results in the pystats diffs.

@markshannon suggested I look at a recent PR to add a globals to constants pass where there should be some changes, but not to the level that we are seeing. The original results stats diff for that PR didn't have the per-benchmark results, so I re-ran it.

These two sets of results (Mark's run, and my later run of the same commits) are in strong agreement, so there doesn't seem to be anything attributable to randomness or things that change between runs. I also ruled out problems with summation (i.e. the totals across all benchmarks not being equal to the sum of all benchmarks). I also don't think there is cross-benchmark contamination -- each benchmark is run with a separate invocation of pyperformance, and the /tmp/py_stats directory is empty in between (I added some asserts to the run to confirm this).

Drilling down on the numbers, the most changed uop in terms of execution count is TO_BOOL_ALWAYS_TRUE:

Name	Base Count	Head Count	Change
TO_BOOL_ALWAYS_TRUE	12,145,706	30,824,186	153.8%

This difference is entirely attributable to two benchmarks:

Benchmark	Base	Head
go	5,840	129,400
pycparser	11,120,400	29675320

The go one is nice to work with because it has no dependencies. Running that benchmark 10 times against the head and base branches produces these numbers exactly every time, so I don't think there is anything non-deterministic in the benchmark.

The other thing that I think @markshannon mentioned should be completely unchanged by the PR is the optimization attempts.

There are many more benchmarks that contribute to this change:

Benchmark	Base	Head
async_generators	1060	1260
asyncio_websockets	420	480
concurrent_imap	4462	4465
dask	4274	4249
deltablue	440	18900
docutils	11920	11980
genshi	35560	35640
go	860	74920
html5lib	1020	1040
mypy2	16536	16597
pycparser	1200	3320
regex_v8	1560	2340
sqlglot	3280	3320
sqlglot_optimize	5160	5220
sqlglot_parse	380	440
sqlglot_transpile	1340	1400
sympy	13798	13903
tornado_http	1080	1140
typing_runtime_protocols	700	780

Again, looking at the go benchmark, I can reproduce these numbers exactly locally in isolation.

Since "optimization attempts" are counted in "JUMP_BACKWARD" (when reaching a threshold), I also compared that, and I get the following Tier 1 counts for "JUMP_BACKWARD":

	Base	Head
Optimization attempts	860	74920
JUMP_BACKWARD	14860	28402880

These numbers are not proportional, but they do at least move in the same direction.

I did confirm the obvious that the benchmark is doing the same amount of work and running the same number of times in both cases (just with adding prints and counting).

I'm completely stumped as to why that PR changes the number of JUMP_BACKWARD and thus optimization attempts -- it doesn't seem like that should be affected at all. But it does seem like that could be the cause of a lot of changes "downstream".

I've created a gist to reproduce this that may be helpful. Provided a path to a CPython checkout with a --enable-pystats build, it runs the go benchmark and reports on the optimization attempts and number of executions of JUMP_BACKWARD.

mdboom commented 4 months ago

Ripping pyperf out of the mix, the difference in the branches is still visible.

Head:

Optimization attempts: 2700 jump backwards: 336462

Base:

Optimization attempts: 43 jump backwards: 734

gvanrossum commented 4 months ago

Assuming the diff is just between Mark's PR and its immediate base revision, doesn't this point in the direction of some unintended effect in the PR?

mdboom commented 4 months ago

Assuming the diff is just between Mark's PR and its immediate base revision, doesn't this point in the direction of some unintended effect in the PR?

I think it does, but it's not obvious to me where that is. I'm hoping Mark has an idea or can suggest something to look at next.

brandtbucher commented 4 months ago

It looks like go modifies globals a lot (specifically, TIMESTAMP and MOVES). That would interact badly with Mark’s PR, which imposes limits on how many global value modifications tier two will tolerate before “giving up”.

mdboom commented 4 months ago

Thanks so much for that, @brandtbucher. That makes a lot of sense.

I think the following would probably be useful:

A pystats counter for when an executor is invalidated when globals change (and any other events that cause executors to be invalidated).
A new version of the go benchmark that doesn't modify globals to compare to. (It's a separate question whether we ship that in pyperformance).
It would be interesting (if possible) to measure the amount of "churn of the same trace". In other words, we measure the number of traces identified ("Optimizations attempted"), but we don't measure when the same JUMP_BACKWARD is repeatedly re-identified over and over. This probably requires adding a cache entry to JUMP_BACKWARD to store this data (when pystats is on)...

mdboom commented 4 months ago

Some interesting things to report:

I modified the go benchmark to not mutate globals.

Diff to `go` benchmark to not modify globals

```diff 20,21d19 < TIMESTAMP = 0 < MOVES = 0 32a31,38 > class State: > def __init__(self): > self.timestamp = 0 > self.moves = 0 > > STATE = State() > > 38,39c44,45 < self.timestamp = TIMESTAMP < self.removestamp = TIMESTAMP --- > self.timestamp = STATE.timestamp > self.removestamp = STATE.timestamp 52,54c58,59 < global TIMESTAMP, MOVES < TIMESTAMP += 1 < MOVES += 1 --- > STATE.timestamp += 1 > STATE.moves += 1 79c84 < self.removestamp = TIMESTAMP --- > self.removestamp = STATE.timestamp 88c93 < if neighbour.color != EMPTY and neighbour.removestamp != TIMESTAMP: --- > if neighbour.color != EMPTY and neighbour.removestamp != STATE.timestamp: 208,209c213 < global TIMESTAMP < TIMESTAMP += 1 --- > STATE.timestamp += 1 222c226 < if neighbour_ref.timestamp != TIMESTAMP: --- > if neighbour_ref.timestamp != STATE.timestamp: 227c231 < neighbour_ref.timestamp = TIMESTAMP --- > neighbour_ref.timestamp = STATE.timestamp ```

When I run this, the number of optimization attempts, traces created and JUMP_BACKWARDS count (in Tier 1) exactly matches the behavior of the base of the PR in question. In other words, if you remove global-modification from the benchmark, the PR has no effect on the number of traces created/invalidated, which makes a lot of sense.

When I add a new counter for executors being invalidated (check my work -- I think I put the counter in the right places), I see the following:

Benchmark	Traces created	Executors invalidated
go, mutating globals	59,080	58,860
go, not mutating globals	860	0

When the benchmark modifies globals in this worst-case scenario, a huge fraction of the traces created are ultimately invalidated. There might be some benefit to turning off optimization attempts at that call site or something after a certain threshold of retries (I think @markshannon already has ideas there). But of course, I haven't measured timings there.

One last surprising (to me) observation: the new guards _CHECK_GLOBALS and _CHECK_BUILTINS added by the PR never deopt over the entire benchmarking suite. I confirmed this locally to make sure the stats weren't just somehow broken. Does this mean, perhaps, that the watchers always "get there first" to invalidate the executor and thus these guards may not be necessary? I could totally be wrong about that, of course, or maybe there's some future in which the watchers won't catch all cases and these guards are the belt-and-suspenders.

gvanrossum commented 4 months ago

I wonder if this means that we don't disable the watcher-based optimization after the watcher fires too many times? IIRC there's supposed to be a counter that switches to a somewhat more conservative optimization strategy that doesn't depend on watchers for that particular globals dict. Maybe there's something wrong with how that counter is managed? @markshannon ?

gvanrossum commented 4 months ago

One last surprising (to me) observation: the new guards _CHECK_GLOBALS and _CHECK_BUILTINS added by the PR never deopt over the entire benchmarking suite.

That doesn't surprise me. They are to detect rare edge cases where someone actually replaces the globals or builtins dict with a whole new dict object. That may not even be possible -- I can't think of any Python-level code that allows this, since a module's __dict__ attribute is read-only. I suppose it might be possible using the C API though.

markshannon commented 4 months ago

I messed up the logic for generating the next dict version, so it resets the watched-modification count to zero, when the dict is modified 😞

On the plus side it is easy to fix.

mdboom commented 4 months ago

Also, on the plus side I learned a lot more about how all this works...

mdboom commented 4 months ago

Closing, as @markshannon's fix is in.

faster-cpython / ideas

Investigation: Why are pystats changing in unexpected ways? #652