Open martis42 opened 10 months ago
Yeah, the caching can only do so well. The stdlib is a decent size. It's just a lot of files that have to symlinked.
I wonder if we'd be able to zip up the stdlib. I'm think that can Just Work.
Otherwise, maybe declare_symlink or symlink(target_path=...)
(which are now out of experimental in bazel 7, I think) could be used to add one file to the runtime's set of files, and it just points back to the location of the runtime. I haven't had a change to try those out, though, so not sure about any downsides they have.
I really wish we had directories as a first-class concept so Bazel knew to materialize a single symlink, but to monitor the contents of it recursively.
A few of us were experimenting with symlinks and declare_directory (TreeArtifacts). The result is not really promising.
For local strategy actions, it works very well: TreeArtifacts create a single symlink to the directory; very cheap! Unfortunately, sandboxed actions don't do that; they copy the whole directory and/or symlink each individual file. We tried to trick bazel into create a symlink, but no luck.
The declare_symlink
option has similar problems. It can create an arbitrary symlink, but a rule can't really know the correct absolute path to the directory.
Some maintainers were talking, and zipping the stdlib is probably the best option available. That would greatly reduce the number of files. A few companies already do this. A PR implementing this would be welcome.
Digging around the files in the python install, I also think that maybe 25-50% of them aren't necessary for running a program. e.g., all the vendored pip dependencies and interpreter headers don't need to be in the runfiles to run a binary.
On the Bazel side, there is https://github.com/bazelbuild/bazel/issues/8230 which is approximately about the cost of creating large sets of files and is one of the better issues demonstrating that symlinking to a directory would be a big performance benefit; this issue also affects js/node
This issue has been automatically marked as stale because it has not had any activity for 180 days. It will be closed if no further activity occurs in 30 days. Collaborators can add an assignee to keep this open indefinitely. Thanks for your contributions to rules_python!
This is still relevant
Question: is using a locally installed platform runtime, instead of a remote downloaded in-build runtime, an appealing option to this problem?
I ask because I've merged some of the basic for using a locally installed runtime (one equivalent to what our hermetic runtimes do) in PR #2000. But, we're trying to figure out some of the use cases for it to better understand what public APIs it needs.
... an appealing option to this problem?
No it is not. Technically it would solve the problem, at least on Linux (I am not working with other platforms). However, besides managing a polyglot workspace Bazel's appeal and big selling point are in my point of view hermeticity and thus reproducibility of actions, which enables the awesome caching performance. A host Interpreter toolchain is to my understanding defeating hermeticity.
That much said, I am still very much interested in rules_python offering better support for host Interpreter toolchains. There are some known quirks with the hermetic toolchain. Switching to the host Interpreter is an easy workaround for those (e. g. during debugging). While it is easy enough to set the host interpreter itself, what is missing are the toolchain files defined here, e.g. python_headers
. Due to this in the project I work with we have our own host toolchains defined and a build setting to opt into using the host toolchain instead of the hermetic one.
what is missing are the toolchain definitions, headers ... we have our own host toolchains defined
Oh that's perfect. That's exactly the sort of case I had in mind. PR #2000 added the basics to do all that, though the APIs are approximately private. I created https://github.com/bazelbuild/rules_python/issues/2070 to collect user feedback.
I'd be particularly interested in if you're doing anything to automatically locate e.g. headers and libraries to wire them into the toolchain. Inspecting the Python install to figure that out was quite a mystery, and I'm really uncertain what edge cases I missed.
We are not dynamically discovering headers and libraries. Currently we only support host toolchains for Linux (Ubuntu) and a subset of Python versions. For this use case it was easier to have dedicated BUILD file per toolchain and let users select them explicitly. From my experience the BUILD files differ only with the version number in the paths to the toolchain files but not structurally. At least for Ubuntu hosts a generic solution should be trivial.
One caveat is that multiple Debian packages are involved in a full Python toolchain. Not everybody might have python3-dev
installed. To make sure users don't experience hard to understand errors it might help to search for some key file in a repository rule and provide a verbose error if they are missing.
(back to OP issue)
I was told that the 7.2 release of Bazel contained some performance improvements to sandboxing, which should help with the overhead of creating files. There's also a flag that might make things faster: --experimental_inmemory_sandbox_stashes
I used my artificial benchmark to look into this issue again.
--experimental_inmemory_sandbox_stashes
does not improve performance for me. Maybe the flag only improves performance in specific use cases which do not fit my system. However, when comparing Bazel versions one sees that Bazel indeed became more efficient for setting up and executing sandboxed tests. For my test setup I saw the following numbers:
Bazel version | Elapsed time |
---|---|
6.5.0 | 54,3 s |
7.0.0 | 48,0 s |
7.2.1 | 36.4 s |
8.0.0-pre.20240718.2 | 36.9 s |
This is great, as this is a free improvement for everybody by simply using a recent Bazel version. Still, compared to using a host interpreter the overhead is significant. I still believe Bazel should support an optimized concept to linking toolchain artifacts into the runfiles tree.
🐞 bug report
Affected Rule
py_binary
andpy_test
Is this a regression?
No
Description
Hermetic toolchains incur a significant overhead to executing Python targets due to the large amount of files which are added to the runfiles tree. You find an analysis of the issue here: https://github.com/martis42/show_hermetic_python_overhead
Bazel caching of course negates much of this issue in daily working. Still, whenever a test has to rerun we pay this additional overhead.
I don't know if there is a "perfect" solution possible given how Bazel sandboxing works. However, it would be great if rues_python would add a note in its documentation for hermetic toolchains that adding
--nolegacy_external_runfiles
to the.bazelrc
file is recommended.Maybe the
rules_python
maintainer can even convince the Bazel maintainer to prioritize working on flipping--legacy_external_runfiles
in Bazel 8 ?🔬 Minimal Reproduction
You see reproduce this with this small example workspace https://github.com/martis42/show_hermetic_python_overhead
🔥 Exception or Error
NA
🌍 Your Environment
Operating System:
Output of
bazel version
:Rules_python version:
Anything else relevant?
https://github.com/bazelbuild/rules_python/issues/1624 might be related. It is however maybe Windows specific.