Symlinked sandbox is slow

lberki commented 2 years ago

Description of the bug:

The symlinked sandbox is slow when there is a large number of input files (I have seen reports of actions with up to 300K)

There are a number of ways one could improve this:

Creating the input directories and symlinks on multiple threads (SandboxHelpers currently does this on one thread)
Traversing the Java -> C++ boundary less frequently
Using one symlink per large tree artifact instead of symlinking each file in it separately
Using io_uring on Linux for more efficient data transfer to the kernel
Keeping the file system created for an action around and re-using it if the same action (or a similar one) is executed again

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

No response

Which operating system are you running Bazel on?

No response

What is the output of `bazel info release`?

No response

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

No response

matthewjh commented 2 years ago

Yes, please. It is known to be particularly slow on MacOS:

https://github.com/bazelbuild/bazel/issues/8230

In my (albeit limited experience), the biggest barrier to integrating Bazel into local developer workflows (where quick feedback is paramount) is the sandboxing time, which in many cases far outstrips action exec time.

brentleyjones commented 2 years ago

Using one symlink per large tree artifact instead of symlinking each file in it separately

This one would really help for Apple platform builds where they produce bundles of bundles.

meisterT commented 2 years ago

cc @larsrc-google

@lberki There is --experimental_reuse_sandbox_directories - have you tried that?

larsrc-google commented 2 years ago

--experimental_reuse_sandbox_directories is essentially your point #5, and it helped a lot. #3 (tree artifacts) does sound reasonable. The others I'd like to have some measurements for first, to see how much we can actually save.

lberki commented 2 years ago

Learned today: https://github.com/ikorennoy/jasyncfio , io_uring in Java (I'm not sure if it's useful and it'd be an extra dependency, but I don't want this nugget of data to get lost)

larsrc-google commented 2 years ago

We'll need some reproducible examples. I tried compiling Bazel itself with and without sandbox in various worker/non-worker configurations, and the difference was minimal.

lberki commented 2 years ago

Did you try synthetic loads? That would be a much easier avenue than testing on Chrome OS / Kleaf builds (AFAIU they have 300K/80K input files, but I don't know how many and how big TreeArtifacts there are in the former)

meisterT commented 2 years ago

@lberki Can you share the build that actually triggers the slowness? From there we can work towards a more minimal repro.

lberki commented 2 years ago

Plussed @larsrc-google into the pertinent threads (unfortunately, they are Google internal communications even though they are about the interaction between two Google open source projects...)

jacky8hyf commented 1 year ago

Kleaf uses sandbox builds by default (though we also encourage developers to disable the costly sandboxes for local development). This feature will greatly improve the build time for Kleaf.

I can provide some metrics for the time spent on sandbox creation for Kleaf builds on build bots (ci.android.com) upon request (the data is public but the dashboard is internal only).

lberki commented 1 year ago

Ack. Numbers would be really helpful to aid in our prioritization decisions.

metti commented 9 months ago

I am currently into potential improvements for the SymlinkedSandbox. In particular, I explore pushing more work batched together to JNI and to facilitate io_uring for I/O.

larsrc-google commented 2 months ago

I looked a bunch at io_uring, but dropped it again when I head it had several bugs, including security-critical ones. Doing a batch API that can be implemented with JNI or io_uring or Loom threads would be good, though.

ismell commented 2 months ago

Using one symlink per large tree artifact instead of symlinking each file in it separately

This one would have made the largest impact for ChromeOS.

lberki commented 2 months ago

😢

bazelbuild / bazel