faster-cpython / ideas

1.67k stars 49 forks source link

Run PGO on actual typical usage code #547

Open rhpvorderman opened 1 year ago

rhpvorderman commented 1 year ago

Currently profile guided optimization is run on the test suite. This has always struck me as a bit odd. For instance, when decompressing a 100GB gzipped file (quite typical in my work), this will call zlib.ZlibDecompressor.decompress a lot, without any errors.

If I look at the test_zlib code, I see a lot of code that deliberately causes errors so the correct behavior can be guaranteed. This holds true for almost all test code. Corner cases, strange errors, these all should be tested but are hardly encountered in any typical scenario.

As a result the PGO is run on a lot of atypical use cases. I wonder how that affects speed. I took a dive in the CPython repo to check how the PGO came to be and from what I can see the regrtest suite was chosen because it gave the best results (https://github.com/python/cpython/issues/69103). But I was not at all involved in this, so maybe various things have been tried already. In that case the issue can be closed.

As an alternative: there is also a performance test with the most commonly used libraries, is there not? Wouldn't that be a better alternative for PGO than a test suite? I currently do not have time to run all the benchmarks, but I want to at least to make sure the idea is out there.

mdboom commented 1 year ago

Also related: https://github.com/faster-cpython/ideas/issues/99

itamaro commented 1 year ago

Also related: https://github.com/faster-cpython/ideas/issues/99

The linked issue is closed, but I'm not sure what was the conclusion, if any.

As far as I can tell, PGO profile is currently collected on the test suite, so it still likely suffers from overrepresentation of uncommon paths.

Curious if there's anything that can be done practically that would be a positive outcome for most users, considering PGO heavily depends on the workload, but it's not practical to produce a single profile that is representative of all the diverse ways that Python is used in the wild.

gpshead commented 1 year ago

CPython execution consists primarily of the PyEval_Frame interpreter loop that run over very predictably constructed sequences of bytecodes performing the same underlying basic sequences of operations. Regardless of the actual code being executed. The particular Python workload you run does not change the bulk of what the interpreter loop is ever asked to do. So... don't go into this expecting any major wins here vs existing PGO. Always feel free to try, but you may not find anything.

What matters from a PGO perspective is that you've exercised a reasonable total set, including performance important extension module code (re, json, codecs, compression libraries, hashlib, ssl, etc) so that the non-main loop areas people also care about performance of get their obvious points exercised.

We've so far not found anything to be generally better than simply running the subset of the test suite we run today.

If you want a specific pyperformance benchmark to run faster, you probably want to run that pyperformance benchmark as your PGO training workload. But I wouldn't expect the change to be dramatic at all, or for that to be great for things beyond that benchmark. Thus the diverse non-slow bits of the test suite we PGO train with by default today.

People with specific huge CPU consumption Python applications are presumed to use their own profiles. Including custom building and linking all of their transitive extension module dependencies that way. Google and Meta have done this for some of theirs. It isn't feasible for the CPython project to offer and isn't something the binary-wheel focused PyPI packaging ecosystem can offer.

kmod commented 1 year ago

Empirically we found that running PGO on the entire test suite (as opposed to the PGO subset) gave a noticeable increase in performance on macrobenchmarks (maybe 0.5%?). I'm pretty sure that the debian packages do this as well for the same reason. I think the disagreeing premise that might lead to our different understandings is the role of _PyEval_EvalFrameDefault in C-level performance profiles?

My experience is a few years out of date at this point and it's entirely possible that the situation has changed since then, but on the bright side this is a pretty straightforward experiment to run so the question could just be settled empirically.