Closed gvanrossum closed 3 years ago
Some notes about where things are at:
I should have some rough benchmark numbers up tomorrow.
Thanks for cleaning up some of my technical debt! After discussing things with Mark this is probably not something we should pursue further in the short term (though I'd like to see those benchmark numbers!). Some issues:
Here are the benchmark results I ran. Note that I ran them in a linux VM on Windows, so the numbers should be considered very rough.
For the most part the disassembler branch improved performance, while a few benchmarks were slower. The mean was slightly faster. However, I don't really trust the results a ton given I used a VM. (Using python3 -m pyperf stats
and python3 -m pyperf hist
show a relatively high std dev.) The nice thing is that there were no massive outliers and the outcome wasn't clearly negative. 🙂
The directory structure I used is the following:
./cpython-perf/
cpython/ # repo
.perf/ # data dir for benchmark runs
pyperformance/ # repo
results-disassembler/
ms-perf.ini
The config file looks like this:
To run the suite I did the following:
py3 -m pyperformance compile ../ms-perf.ini cc12888f9b master
mv ../cpython/.perf/*.json.gz ../results-disassembler
py3 -m pyperformance compile ../ms-perf.ini disassembler
mv ../cpython/.perf/*.json.gz ../results-disassembler
py3 -m pyperformance compare --csv ../disassembler-delta.csv ../results-disassembler/*.json.gz
py3 -m pyperformance compare -O table ../results-disassembler/*.json.gz
Thanks! I agree that the variance is a bit high to put much faith in these. One conclusion is that the existing LOAD_ATTR optimization for slots isn't so bad! Another is that probably most of the benchmark suite doesn't use slots a lot. The one that I know is sensitive to slots performance ("float") also enjoyed the highest speedup, validating that we're indeed seeing the effect of the work on slots in this branch.
See my disassembler branch
If we're going to be specializing bytecode, at some point we're going to need to fix up jumps. Dino's Shadowcode (see #3) solves this by only ever replacing a single opcode with another, inserting NOPs when the new opcode is shorter, so all jump targets are unchanged. But I expect that at some point we're going to need to replace a single opcode with several others.
So I did some work on disassembling bytecode back into the data structures used by the compiler/assembler. Given a bytecode array you'll get a series of basic blocks back, which you can modify (e.g. insert or delete instructions or add new basic blocks), and then present them to the assembler which will construct a new bytecode object for you, with jumps fixed.
As a demonstration I added some opcodes for indexing slots (LOAD_ATTR_SLOT and STORE_ATTR_SLOT -- the latter was added by @ericsnowcurrently), but I suspect that the true value here is in the infrastructure for disassembling. (That particular optimization is quiestionable, since we already have an inline cache for slots in LOAD_ATTR.)
There are some unresolved issues, e.g. I assume that the constants table remains the same (and there's an implicit assumption that
None
is always present). We could probably refactor more of the assembler to check this.