faster-cpython / ideas

1.67k stars 49 forks source link

Investigation: Why are pystats changing in unexpected ways? #652

Closed mdboom closed 4 months ago

mdboom commented 4 months ago

We are still seeing unexpected results in the pystats diffs.

@markshannon suggested I look at a recent PR to add a globals to constants pass where there should be some changes, but not to the level that we are seeing. The original results stats diff for that PR didn't have the per-benchmark results, so I re-ran it.

These two sets of results (Mark's run, and my later run of the same commits) are in strong agreement, so there doesn't seem to be anything attributable to randomness or things that change between runs. I also ruled out problems with summation (i.e. the totals across all benchmarks not being equal to the sum of all benchmarks). I also don't think there is cross-benchmark contamination -- each benchmark is run with a separate invocation of pyperformance, and the /tmp/py_stats directory is empty in between (I added some asserts to the run to confirm this).

Drilling down on the numbers, the most changed uop in terms of execution count is TO_BOOL_ALWAYS_TRUE:

Name Base Count Head Count Change
TO_BOOL_ALWAYS_TRUE 12,145,706 30,824,186 153.8%

This difference is entirely attributable to two benchmarks:

Benchmark Base Head
go 5,840 129,400
pycparser 11,120,400 29675320

The go one is nice to work with because it has no dependencies. Running that benchmark 10 times against the head and base branches produces these numbers exactly every time, so I don't think there is anything non-deterministic in the benchmark.

The other thing that I think @markshannon mentioned should be completely unchanged by the PR is the optimization attempts.

There are many more benchmarks that contribute to this change:

Benchmark Base Head
async_generators 1060 1260
asyncio_websockets 420 480
concurrent_imap 4462 4465
dask 4274 4249
deltablue 440 18900
docutils 11920 11980
genshi 35560 35640
go 860 74920
html5lib 1020 1040
mypy2 16536 16597
pycparser 1200 3320
regex_v8 1560 2340
sqlglot 3280 3320
sqlglot_optimize 5160 5220
sqlglot_parse 380 440
sqlglot_transpile 1340 1400
sympy 13798 13903
tornado_http 1080 1140
typing_runtime_protocols 700 780

Again, looking at the go benchmark, I can reproduce these numbers exactly locally in isolation.

Since "optimization attempts" are counted in "JUMP_BACKWARD" (when reaching a threshold), I also compared that, and I get the following Tier 1 counts for "JUMP_BACKWARD":

Base Head
Optimization attempts 860 74920
JUMP_BACKWARD 14860 28402880

These numbers are not proportional, but they do at least move in the same direction.

I did confirm the obvious that the benchmark is doing the same amount of work and running the same number of times in both cases (just with adding prints and counting).

I'm completely stumped as to why that PR changes the number of JUMP_BACKWARD and thus optimization attempts -- it doesn't seem like that should be affected at all. But it does seem like that could be the cause of a lot of changes "downstream".

I've created a gist to reproduce this that may be helpful. Provided a path to a CPython checkout with a --enable-pystats build, it runs the go benchmark and reports on the optimization attempts and number of executions of JUMP_BACKWARD.

mdboom commented 4 months ago

Ripping pyperf out of the mix, the difference in the branches is still visible.

Head:

Optimization attempts: 2700 jump backwards: 336462

Base:

Optimization attempts: 43 jump backwards: 734

gvanrossum commented 4 months ago

Assuming the diff is just between Mark's PR and its immediate base revision, doesn't this point in the direction of some unintended effect in the PR?

mdboom commented 4 months ago

Assuming the diff is just between Mark's PR and its immediate base revision, doesn't this point in the direction of some unintended effect in the PR?

I think it does, but it's not obvious to me where that is. I'm hoping Mark has an idea or can suggest something to look at next.

brandtbucher commented 4 months ago

It looks like go modifies globals a lot (specifically, TIMESTAMP and MOVES). That would interact badly with Mark’s PR, which imposes limits on how many global value modifications tier two will tolerate before “giving up”.

mdboom commented 4 months ago

Thanks so much for that, @brandtbucher. That makes a lot of sense.

I think the following would probably be useful:

  1. A pystats counter for when an executor is invalidated when globals change (and any other events that cause executors to be invalidated).

  2. A new version of the go benchmark that doesn't modify globals to compare to. (It's a separate question whether we ship that in pyperformance).

  3. It would be interesting (if possible) to measure the amount of "churn of the same trace". In other words, we measure the number of traces identified ("Optimizations attempted"), but we don't measure when the same JUMP_BACKWARD is repeatedly re-identified over and over. This probably requires adding a cache entry to JUMP_BACKWARD to store this data (when pystats is on)...

mdboom commented 4 months ago

Some interesting things to report:

I modified the go benchmark to not mutate globals.

Diff to `go` benchmark to not modify globals ```diff 20,21d19 < TIMESTAMP = 0 < MOVES = 0 32a31,38 > class State: > def __init__(self): > self.timestamp = 0 > self.moves = 0 > > STATE = State() > > 38,39c44,45 < self.timestamp = TIMESTAMP < self.removestamp = TIMESTAMP --- > self.timestamp = STATE.timestamp > self.removestamp = STATE.timestamp 52,54c58,59 < global TIMESTAMP, MOVES < TIMESTAMP += 1 < MOVES += 1 --- > STATE.timestamp += 1 > STATE.moves += 1 79c84 < self.removestamp = TIMESTAMP --- > self.removestamp = STATE.timestamp 88c93 < if neighbour.color != EMPTY and neighbour.removestamp != TIMESTAMP: --- > if neighbour.color != EMPTY and neighbour.removestamp != STATE.timestamp: 208,209c213 < global TIMESTAMP < TIMESTAMP += 1 --- > STATE.timestamp += 1 222c226 < if neighbour_ref.timestamp != TIMESTAMP: --- > if neighbour_ref.timestamp != STATE.timestamp: 227c231 < neighbour_ref.timestamp = TIMESTAMP --- > neighbour_ref.timestamp = STATE.timestamp ```

When I run this, the number of optimization attempts, traces created and JUMP_BACKWARDS count (in Tier 1) exactly matches the behavior of the base of the PR in question. In other words, if you remove global-modification from the benchmark, the PR has no effect on the number of traces created/invalidated, which makes a lot of sense.

When I add a new counter for executors being invalidated (check my work -- I think I put the counter in the right places), I see the following:

Benchmark Traces created Executors invalidated
go, mutating globals 59,080 58,860
go, not mutating globals 860 0

When the benchmark modifies globals in this worst-case scenario, a huge fraction of the traces created are ultimately invalidated. There might be some benefit to turning off optimization attempts at that call site or something after a certain threshold of retries (I think @markshannon already has ideas there). But of course, I haven't measured timings there.

One last surprising (to me) observation: the new guards _CHECK_GLOBALS and _CHECK_BUILTINS added by the PR never deopt over the entire benchmarking suite. I confirmed this locally to make sure the stats weren't just somehow broken. Does this mean, perhaps, that the watchers always "get there first" to invalidate the executor and thus these guards may not be necessary? I could totally be wrong about that, of course, or maybe there's some future in which the watchers won't catch all cases and these guards are the belt-and-suspenders.

gvanrossum commented 4 months ago

I wonder if this means that we don't disable the watcher-based optimization after the watcher fires too many times? IIRC there's supposed to be a counter that switches to a somewhat more conservative optimization strategy that doesn't depend on watchers for that particular globals dict. Maybe there's something wrong with how that counter is managed? @markshannon ?

gvanrossum commented 4 months ago

One last surprising (to me) observation: the new guards _CHECK_GLOBALS and _CHECK_BUILTINS added by the PR never deopt over the entire benchmarking suite.

That doesn't surprise me. They are to detect rare edge cases where someone actually replaces the globals or builtins dict with a whole new dict object. That may not even be possible -- I can't think of any Python-level code that allows this, since a module's __dict__ attribute is read-only. I suppose it might be possible using the C API though.

markshannon commented 4 months ago

I messed up the logic for generating the next dict version, so it resets the watched-modification count to zero, when the dict is modified 😞

On the plus side it is easy to fix.

mdboom commented 4 months ago

Also, on the plus side I learned a lot more about how all this works...

mdboom commented 4 months ago

Closing, as @markshannon's fix is in.