bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
22.79k stars 4k forks source link

Actions fetched from remote cache / executed remotely should not count towards parallelism limit #6394

Open ob opened 5 years ago

ob commented 5 years ago

When a fairly large application is built using the remote cache, and all the actions are fully cached. Bazel still keeps the parallelism set to the number of cores in the machine. However, most actions are just waiting for network I/O.

With the default settings on an 8-core machine, I get:

$ time bazel build //...
## output elided
INFO: Elapsed time: 97.567s, Critical Path: 26.44s
INFO: 2000 processes: 2000 remote cache hit.
INFO: Build completed successfully, 2240 total actions

real    1m38.518s
user    0m0.051s
sys 0m0.061s

But if I bump up the number of jobs to a crazy number:

$ time bazel build --jobs 2000 //...
## output elided
INFO: Elapsed time: 39.535s, Critical Path: 31.33s
INFO: 2000 processes: 2000 remote cache hit.
INFO: Build completed successfully, 2240 total actions

real    0m40.483s
user    0m0.048s
sys 0m0.058s

I think the --jobs option should only apply to local actions.

buchgr commented 5 years ago

Thanks @ob! A colleague of mine has a prototype of this, however I still expect this to be at least 3-6 months away to land in a released Bazel version!

philwo commented 5 years ago

FWIW, we "fix" this in Google by actually running with --jobs=200 (or higher). Local actions are still limited by local resources, so this is fine in general.

buchgr commented 5 years ago

@philwo @ulfjack has a prototype patch to no longer be limited by --jobs for remote execution.

ittaiz commented 5 years ago

@philwo

Local actions are still limited by local resources, so this is fine in general.

Means you run with jobs together with local_resources?

buchgr commented 5 years ago

@ulfjack what's the status of your work in this area? is there a tracking bug on github?

ulfjack commented 5 years ago

Parts of the prototype have been submitted, but I haven't even sent out some critical parts. Making it work also requires rewriting the RemoteSpawnRunner to be async and use ListenableFuture, for which I do not currently have plans. I am not aware of a tracking bug on GitHub apart from this one.

buchgr commented 5 years ago

Making it work also requires rewriting the RemoteSpawnRunner to be async and use ListenableFuture

That I'd be happy to take over :-)

ulfjack commented 5 years ago

Happy for you to start working on the RSR in parallel to me landing the Skyframe changes that are also required.

buchgr commented 5 years ago

Broke this out into https://github.com/bazelbuild/bazel/issues/7182.

ulfjack commented 5 years ago

Commit 9beabe09bab5c371d353cca3c77c4e57de555ac0 is related.

ulfjack commented 5 years ago

Also related: 47d29eb99b6df063cecd791ddf197b0a6a78ea69 57f2f586bde98adc519731a354884140aeac5437 14f8b109b9f987f1b0c69c8cf399326740749382 rolled back as 68fc46b7ac2a015cbbd4e6602f2310a935783866 due to increased memory consumption

Globegitter commented 5 years ago

While this seems to be a great idea in general, would there still be a separate way to limit these async operations?

ulfjack commented 5 years ago

What for?

Globegitter commented 5 years ago

We are currently running in a quite restricted CI with 4GB memory, which is hopefully going to change soon, but with that setup we are currently running on the very edge of memory usage, leading to crashes and frequent flag tweaking due to that, that I am worried if we just increase the work that is going on in CI of that is going to make this worse. Even if we are just running further cache downloads in the background.

ulfjack commented 5 years ago

I assume you mean Bazel's memory consumption, not the remote execution systems. Let's first look at how it's going. It's not clear at this point that async execution will increase Bazel's memory consumption. There is reason to believe that the current code is not ideal wrt. memory consumption, with data being retained although it could be garbage collected.

Globegitter commented 5 years ago

Yes exactly, bazel's memory consumption. We currently do not use remote execution, we just use a remote cache. Looking forward to seeing this come in and test it out!

ulfjack commented 5 years ago

It looks like there is an increase in memory consumption with async execution. I also want to add that Bazel support is primarily blocked on #7182.

RNabel commented 4 years ago

@buchgr Could you comment on @ittaiz question for future reference:

Means you run with jobs together with local_resources?

And if this is a workaround, could you describe how --local_resources interacts with --jobs? It is not clear what would happen if you set --jobs=200 and --local_cpu_resources=4 at the same time.

ulfjack commented 4 years ago

At this time, --jobs determines how many threads Bazel creates internally, and local_cpu_resources determines how many subprocesses Bazel is allowed to run concurrently. However, Bazel threads block on local & remote subprocesses. Therefore, if --jobs is less than --local_cpu_resources, then --local_cpu_resources is effectively ignored, and Bazel runs at most --jobs subprocesses.

For remote builds, however, --jobs determines how many remote processes can run in parallel, whereas --local_cpu_resources is ignored. That means if you use remote caching or remote execution, you must increase --jobs to get a speedup.

However, changes are afoot, although I suspect that they're not going to be finished before the end of the year, and might take into next year. Specifically, we're working on decoupling --jobs. The plan is for Bazel to manage both local and remote execution without blocking threads. This makes it so --jobs no longer implicitly limits the number of local subprocesses in favor of --local_cpu_resources. Similar for remote execution. That should avoid the need to tweak --jobs if you want to use remote execution, and improve scaling if you have a lot of remote executors while allowing you to limit local Bazel CPU consumption.

ianoc-stripe commented 4 years ago

@ulfjack should --local_cpu_resources effect build-runfiles ? it doesn't seem to unless i'm doing this incorrectly. -- when we have a lot of remote cache hits the local machine gets overwhelmed unpacking runfiles currently if we have high jobs parallelism

ulfjack commented 4 years ago

I think it doesn’t right now. I can change that if it’s a problem.

On Wed 23. Oct 2019 at 20:35, ianoc-stripe notifications@github.com wrote:

@ulfjack https://github.com/ulfjack should --local_cpu_resources effect build-runfiles ? it doesn't seem to unless i'm doing this incorrectly. -- when we have a lot of remote cache hits the local machine gets overwhelmed unpacking runfiles currently if we have high jobs parallelism

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/6394?email_source=notifications&email_token=ABYD2YPANNB747C5N35JR43QQCDNLA5CNFSM4F3UNPIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECCH7WQ#issuecomment-545554394, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYD2YMHRLSZYK2Y6NSOAVTQQCDNLANCNFSM4F3UNPIA .

ianoc-stripe commented 4 years ago

It does for us, i'm not sure if others don't see it. but for tests + binaries across a repo. If you pull and get a lot of cache hits having jobs set to say 1000, means the machine locks up more or less with the unpacking of runfiles. So having the runfiles step included in local cpu usage would be great as a means to avoid that. (Now we tell folks to ctrl+c and try run with --jobs 10 when they see it which isn't ideal).

So I would love if you could change it, thank you

benjaminp commented 4 years ago

45a9bc2ca456e76b82bd6c479cacd6081d79e9f5 was a change to the resource requirements of runfiles trees, which allowed more parallelism. Probably this should be discussed in another issue, though.

github-actions[bot] commented 1 year ago

Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage) if you think this issue is still relevant or you are interested in getting the issue resolved.

meisterT commented 1 year ago

We are working on this again, in a different way. Next step is to upgrade to a modem JDK which will allow us to use loom/virtual threads.

Ryang20718 commented 1 year ago

Reading through this thread.

If I'm using local execution with a remote cache and almost all tests are cached. with --jobs set to 4, I see. (currently on bazel 6.0.0)

21s remote-cache, linux-sandbox ... (8 actions running)
29s remote-cache, linux-sandbox ... (8 actions running)

With jobs set to 8, I see a max of 16 actions. This would mean jobs fetching from remote cache is not counting towards parallelism. Is there a way we can configure it to do so?