Closed mdboom closed 4 months ago
Ripping pyperf
out of the mix, the difference in the branches is still visible.
Head:
Optimization attempts: 2700 jump backwards: 336462
Base:
Optimization attempts: 43 jump backwards: 734
Assuming the diff is just between Mark's PR and its immediate base revision, doesn't this point in the direction of some unintended effect in the PR?
Assuming the diff is just between Mark's PR and its immediate base revision, doesn't this point in the direction of some unintended effect in the PR?
I think it does, but it's not obvious to me where that is. I'm hoping Mark has an idea or can suggest something to look at next.
It looks like go
modifies globals a lot (specifically, TIMESTAMP
and MOVES
). That would interact badly with Mark’s PR, which imposes limits on how many global value modifications tier two will tolerate before “giving up”.
Thanks so much for that, @brandtbucher. That makes a lot of sense.
I think the following would probably be useful:
A pystats counter for when an executor is invalidated when globals change (and any other events that cause executors to be invalidated).
A new version of the go benchmark that doesn't modify globals to compare to. (It's a separate question whether we ship that in pyperformance).
It would be interesting (if possible) to measure the amount of "churn of the same trace". In other words, we measure the number of traces identified ("Optimizations attempted"), but we don't measure when the same JUMP_BACKWARD
is repeatedly re-identified over and over. This probably requires adding a cache entry to JUMP_BACKWARD to store this data (when pystats is on)...
Some interesting things to report:
I modified the go
benchmark to not mutate globals.
When I run this, the number of optimization attempts, traces created and JUMP_BACKWARDS
count (in Tier 1) exactly matches the behavior of the base of the PR in question. In other words, if you remove global-modification from the benchmark, the PR has no effect on the number of traces created/invalidated, which makes a lot of sense.
When I add a new counter for executors being invalidated (check my work -- I think I put the counter in the right places), I see the following:
Benchmark | Traces created | Executors invalidated |
---|---|---|
go, mutating globals | 59,080 | 58,860 |
go, not mutating globals | 860 | 0 |
When the benchmark modifies globals in this worst-case scenario, a huge fraction of the traces created are ultimately invalidated. There might be some benefit to turning off optimization attempts at that call site or something after a certain threshold of retries (I think @markshannon already has ideas there). But of course, I haven't measured timings there.
One last surprising (to me) observation: the new guards _CHECK_GLOBALS
and _CHECK_BUILTINS
added by the PR never deopt over the entire benchmarking suite. I confirmed this locally to make sure the stats weren't just somehow broken. Does this mean, perhaps, that the watchers always "get there first" to invalidate the executor and thus these guards may not be necessary? I could totally be wrong about that, of course, or maybe there's some future in which the watchers won't catch all cases and these guards are the belt-and-suspenders.
I wonder if this means that we don't disable the watcher-based optimization after the watcher fires too many times? IIRC there's supposed to be a counter that switches to a somewhat more conservative optimization strategy that doesn't depend on watchers for that particular globals dict. Maybe there's something wrong with how that counter is managed? @markshannon ?
One last surprising (to me) observation: the new guards
_CHECK_GLOBALS
and_CHECK_BUILTINS
added by the PR never deopt over the entire benchmarking suite.
That doesn't surprise me. They are to detect rare edge cases where someone actually replaces the globals or builtins dict with a whole new dict object. That may not even be possible -- I can't think of any Python-level code that allows this, since a module's __dict__
attribute is read-only. I suppose it might be possible using the C API though.
I messed up the logic for generating the next dict version, so it resets the watched-modification count to zero, when the dict is modified 😞
On the plus side it is easy to fix.
Also, on the plus side I learned a lot more about how all this works...
Closing, as @markshannon's fix is in.
We are still seeing unexpected results in the pystats diffs.
@markshannon suggested I look at a recent PR to add a globals to constants pass where there should be some changes, but not to the level that we are seeing. The original results stats diff for that PR didn't have the per-benchmark results, so I re-ran it.
These two sets of results (Mark's run, and my later run of the same commits) are in strong agreement, so there doesn't seem to be anything attributable to randomness or things that change between runs. I also ruled out problems with summation (i.e. the totals across all benchmarks not being equal to the sum of all benchmarks). I also don't think there is cross-benchmark contamination -- each benchmark is run with a separate invocation of
pyperformance
, and the/tmp/py_stats
directory is empty in between (I added some asserts to the run to confirm this).Drilling down on the numbers, the most changed uop in terms of execution count is
TO_BOOL_ALWAYS_TRUE
:This difference is entirely attributable to two benchmarks:
The
go
one is nice to work with because it has no dependencies. Running that benchmark 10 times against the head and base branches produces these numbers exactly every time, so I don't think there is anything non-deterministic in the benchmark.The other thing that I think @markshannon mentioned should be completely unchanged by the PR is the optimization attempts.
There are many more benchmarks that contribute to this change:
Again, looking at the
go
benchmark, I can reproduce these numbers exactly locally in isolation.Since "optimization attempts" are counted in "JUMP_BACKWARD" (when reaching a threshold), I also compared that, and I get the following Tier 1 counts for "JUMP_BACKWARD":
These numbers are not proportional, but they do at least move in the same direction.
I did confirm the obvious that the benchmark is doing the same amount of work and running the same number of times in both cases (just with adding
print
s and counting).I'm completely stumped as to why that PR changes the number of JUMP_BACKWARD and thus optimization attempts -- it doesn't seem like that should be affected at all. But it does seem like that could be the cause of a lot of changes "downstream".
I've created a gist to reproduce this that may be helpful. Provided a path to a CPython checkout with a
--enable-pystats
build, it runs thego
benchmark and reports on the optimization attempts and number of executions ofJUMP_BACKWARD
.