faster-cpython / ideas

1.67k stars 49 forks source link

Use read-optimized ZIP files to store PYC files #500

Closed lpereira closed 1 year ago

lpereira commented 1 year ago

On platforms where opening files may be an inefficient operation (e.g. those with nosy antivirus software), starting applications that import a lot of modules can be significantly slow. This can be a problem especially when tools written in Python are repeatedly called from batch files as the import time can add up pretty quickly.

One idea that might be worth considering is to, instead of generating PYC files individually on a file system, to generate them inside a ZIP file (with some other extension, maybe PYZ?). If this file is organized in such a way that respects the order that the modules are loaded, we can also hint to the operating system to readahead the whole archive as we know they're going to be needed; this strategy is used by Firefox when reading ZIPs.

I'm suggesting a ZIP file here because we already have zipimporter and this could be prototyped rather quickly, but this does not necessarily needs to be implemented as a ZIP file. I'm not even considering compression at this point (as temporary memory would need to be allocated to decompress the PYC files to be unmarshalled, and this could negate some of the performance benefits especially on systems with fast disks), although that's certainly an option in an archive format such as ZIP.

One issue that might arise is that, with a PYZ file, invalidating a single PYC file when the respective PY file is changed (especially if the PYZ file is organized following the import order) becomes difficult, so this might not be an option if someone is developing the tools written in Python rather than just using them.

arhadthedev commented 1 year ago

By coincidence, there is a PR that could make such a thing possible (edit: for stdlib as a library set modified with relatively rare Python updates only):

gvanrossum commented 1 year ago

Could the zip be memory mapped? That might make unmarshal even faster.

lpereira commented 1 year ago

Memory-mapping the whole thing is what I would do on systems where that's supported, yes.

oraluben commented 1 year ago

I'd like to re-raise our project here for some reference. It completely memory-map code objects into a single file and avoids multiple IO ops and even unmarshaling.

It faces similar issues like this one (tricky when code changed) and sometimes worse (only one memory-mapped file can be loaded at a time, but there’s no fundamental issue on this with zip), but for stable deployment, I’d expect a memory-mapped file will give better speedup.

Thus I think this is better than the zip proposal in some circumstances, and I’m willing to draft a PEP if needed.

gvanrossum commented 1 year ago

Yeah, I was thinking of the Alibaba project too, but the difference here (unless I misunderstand @lpereira) is that here we're just considering putting the existing PYC files in a zip file. This makes the tooling much simpler. However, it solves fewer problems -- basically this issue only really solves the problems caused by aggressive virus scanners like Windows Defender.

When I brought up mmap I was thinking of just mmap'ping the zip file. Presumably the central index of the zip file will make it simple to find each individual item in the zip file, and then we can just call PyMarshal_ReadObjectFromString.

oraluben commented 1 year ago

On the other hand, I suspect if mmaping a zip file will give significant difference (than reading a file from disk). In our project, we have to preload the mmap into memory to avoid page fault, otherwise the importing will be even slower. For the zip one, I think it'll be same without preloading; but preloading one file is not different from just reading all of its contents into memory.

gvanrossum commented 1 year ago

That's an interesting aspect of mmap -- I have to admit I have little experience with it personally so maybe we should implement the simpler version (without mmap) first, if we're going to do anything like this.

lpereira commented 1 year ago

The readahead that Firefox does (as mentioned in the linked blog post) does roughly what pre-faulting the whole mapped space exactly for the reason that @oraluben mentioned. This would avoid (or at the very least, diminish) latency spikes while unmarshaling when the kernel is handling the fault. (Laying the file in the import order is also helpful as it helps prefetching and reduces the possibility of the ZIP reader bounce around the mapped space, which can be bad if the data wasn't paged in.)

In any case, I think that patching zipimporter just enough so that it would understand what a PYZ file is, and instead of generating PYC files on-disk, we generated them inside a PYZ file instead (even if only when given a command-line parameter), would be enough to validate that this approach works well on Windows with antivirus software. Optimization strategies can come later. Since @gvanrossum mentioned importing from ZIP files isn't that common (I remember using it in an embedded Linux system a good while back, though!), I'm sure there are low-hanging fruit optimization potential there that could be looked at.

mkbosmans commented 1 year ago

This would be very welcome for those of us having to deal with corporate environments where an operating system that has slow process startup is handicapped further by a virus scanner (and there are definitely more aggressive virus scanners than Windows Defender out there)

I just want to caution for too much optimism here. As far as I can tell, most virus scanner go out of their way inspecting the contents of zip files too, so this whole endeavor might not be a net win.

2-5 commented 1 year ago

How about storing the PYC in one SQLite database per venv? In one machine learning venv I have 10k PYC files (100 MB) spread over 1000 __pycache__ directories.

SQLite is well optimized and debugged, and already implements all the tricky parts - how do you update a PYC inside the zip? Do you reclaim/reuse freed space? What if two Python processes want to operate at the same time on the ZIP?

Caching parsed information in one SQLite database is already used by many IDEs to speed up project loading.

ericsnowcurrently commented 1 year ago

How about storing the PYC in one SQLite database per venv?

@zooba

gvanrossum commented 1 year ago

I really like the lateral thinking in the idea of using a SQLite database. I hope someone will attempt a prototype. I expect some bootstrapping pains (the .py files in the sqlite3 directory would need to be (deep)frozen) but I don't think that will be a big problem.

2-5 commented 1 year ago

One alternative to freezing the .py sqlite3 package is for the C import code to directly call the C API of SQLite. I'm not sure what implications that would bring for importlib. Maybe importlib could use the sqlite3 package.

zooba commented 1 year ago

I have a prototype of packing into a SQLite DB, but it doesn't seem to provide any real advantage over a zip file. Besides bypassing the AV, the biggest win I got was preloading the DB into RAM, and caching qualname->path mapping to bypass all the built-in importers (essentially, their search functionality). And on WSL/Ubuntu I only broke even after adding all these tricks and testing compression/CPU/IO/etc. tradeoffs.

My prototype was based around the idea that you'd install all your packages, then run a caching process to generate the DB, then never look at the real filesystem again. (You don't even need the source files at that point, though having them means you get better tracebacks, etc.) So not a great general-purpose approach, but I think anything that allows more dynamic behaviour is going to be a regression. As a way to make using pyz files more attractive (as we discussed at https://discuss.python.org/t/allow-uploading-pyz-zipapp-files-to-pypi/19263), it might help, but probably isn't a significant improvement over zip files.

Incidentally, we should have some updates coming on the Windows/Windows Defender side to help improve performance in Python-like scenarios. How widely they get deployed is going to depend on the security implications, but anyone who gets them should see the import overhead go away almost completely.

2-5 commented 1 year ago

And on WSL/Ubuntu I only broke even after adding all these tricks and testing compression/CPU/IO/etc. tradeoffs.

Did you enable mmap for SQLite? In their 2017 tests, enabling that gives a 30% improvement over the filesystem even on Ubuntu.

Further performance improves can be made by using the memory-mapped I/O feature of SQLite. In the next chart, the entire 1GB database file is memory mapped and blobs are read (in random order) using the sqlite3_blob_read() interface. With these optimizations, SQLite is twice as fast as Android or MacOS-X and over 10 times faster than Windows.

https://www.sqlite.org/fasterthanfs.html

gvanrossum commented 1 year ago

I wouldn't use WSL to make any claims about performance (either positive or negative). Is your prototype available publicly somewhere, so people could at least play with it?

zooba commented 1 year ago

Did you enable mmap for SQLite?

I only used what's in the sqlite3 module, though I did use the dump method to load it all into a :memory: connection rather than continuously reading from the disk. It certainly wasn't using memory mapping on Windows by default, so doing one straight read was a huge improvement over the default.

Is your prototype available publicly somewhere, so people could at least play with it?

No, because I've been tweaking it into something useful. It's in a mid-state right now where it neither works nor can it easily be reverted back to the prototype form - that'll teach me for not keeping it in version control ;)

2-5 commented 1 year ago

I did a quick proof-of-concept of SQLite .pyc caching - https://github.com/2-5/cpython/tree/sqlite-pyc

It involved intercepting the 3 places in importlib._bootstrap_external where .pyc are read/written to files and adding one new builtin module, _sqlitepyc with 3 functions: init(sqlite_path), get(bytecode_path) and set(bytecode_path, data)

I added a new -f CLI flag and associated env variable (PYTHONUSESQLITEPYCACHE) so you can easily switch without rebuilding anything.

Active Windows Defender - 25% faster

defender

Excluded from Windows Defender - 7% faster

no_defender

What is measured in the graphs above

The following script is run from another runner script, with or without the new -f argument, and the whole process time is measured (so including startup/teardown).

import sys

import json
import asyncio
import unittest
import email.policy

import networkx

def main():
    return 99

if __name__ == "__main__":
    sys.exit(main())

The script imports a number of packages which generate large numbers of .pyc files. networkx is external, so it needs installing. In total about 400 .pyc files. Unfortunately I couldn't test with large machine learning packages since they all have C-extensions and don't yet have 3.12 binaries.

There are 400 runs of each kind, and no .pyc/.sqlite files are deleted during the test, so all the OS file caches are extremely warm.

The test files are in lib/test/sqlitepyc.

markshannon commented 1 year ago

Experimenting with zip shows ~60% compression on a few pyc files. We can, however, get similar compression by skipping caches and better removal of redundancy in the const and names of code objects. Custom compression also gives us improved performance of unmarshalling, whereas unzipping is relatively slow.