Open ob opened 6 years ago
Thanks @ob! A colleague of mine has a prototype of this, however I still expect this to be at least 3-6 months away to land in a released Bazel version!
FWIW, we "fix" this in Google by actually running with --jobs=200 (or higher). Local actions are still limited by local resources, so this is fine in general.
@philwo @ulfjack has a prototype patch to no longer be limited by --jobs
for remote execution.
@philwo
Local actions are still limited by local resources, so this is fine in general.
Means you run with jobs
together with local_resources
?
@ulfjack what's the status of your work in this area? is there a tracking bug on github?
Parts of the prototype have been submitted, but I haven't even sent out some critical parts. Making it work also requires rewriting the RemoteSpawnRunner to be async and use ListenableFuture, for which I do not currently have plans. I am not aware of a tracking bug on GitHub apart from this one.
Making it work also requires rewriting the RemoteSpawnRunner to be async and use ListenableFuture
That I'd be happy to take over :-)
Happy for you to start working on the RSR in parallel to me landing the Skyframe changes that are also required.
Broke this out into https://github.com/bazelbuild/bazel/issues/7182.
Commit 9beabe09bab5c371d353cca3c77c4e57de555ac0 is related.
Also related: 47d29eb99b6df063cecd791ddf197b0a6a78ea69 57f2f586bde98adc519731a354884140aeac5437 14f8b109b9f987f1b0c69c8cf399326740749382 rolled back as 68fc46b7ac2a015cbbd4e6602f2310a935783866 due to increased memory consumption
While this seems to be a great idea in general, would there still be a separate way to limit these async operations?
What for?
We are currently running in a quite restricted CI with 4GB memory, which is hopefully going to change soon, but with that setup we are currently running on the very edge of memory usage, leading to crashes and frequent flag tweaking due to that, that I am worried if we just increase the work that is going on in CI of that is going to make this worse. Even if we are just running further cache downloads in the background.
I assume you mean Bazel's memory consumption, not the remote execution systems. Let's first look at how it's going. It's not clear at this point that async execution will increase Bazel's memory consumption. There is reason to believe that the current code is not ideal wrt. memory consumption, with data being retained although it could be garbage collected.
Yes exactly, bazel's memory consumption. We currently do not use remote execution, we just use a remote cache. Looking forward to seeing this come in and test it out!
It looks like there is an increase in memory consumption with async execution. I also want to add that Bazel support is primarily blocked on #7182.
@buchgr Could you comment on @ittaiz question for future reference:
Means you run with
jobs
together withlocal_resources
?
And if this is a workaround, could you describe how --local_resources
interacts with --jobs
? It is not clear what would happen if you set --jobs=200
and --local_cpu_resources=4
at the same time.
At this time, --jobs determines how many threads Bazel creates internally, and local_cpu_resources determines how many subprocesses Bazel is allowed to run concurrently. However, Bazel threads block on local & remote subprocesses. Therefore, if --jobs is less than --local_cpu_resources, then --local_cpu_resources is effectively ignored, and Bazel runs at most --jobs subprocesses.
For remote builds, however, --jobs determines how many remote processes can run in parallel, whereas --local_cpu_resources is ignored. That means if you use remote caching or remote execution, you must increase --jobs to get a speedup.
However, changes are afoot, although I suspect that they're not going to be finished before the end of the year, and might take into next year. Specifically, we're working on decoupling --jobs. The plan is for Bazel to manage both local and remote execution without blocking threads. This makes it so --jobs no longer implicitly limits the number of local subprocesses in favor of --local_cpu_resources. Similar for remote execution. That should avoid the need to tweak --jobs if you want to use remote execution, and improve scaling if you have a lot of remote executors while allowing you to limit local Bazel CPU consumption.
@ulfjack should --local_cpu_resources
effect build-runfiles
? it doesn't seem to unless i'm doing this incorrectly. -- when we have a lot of remote cache hits the local machine gets overwhelmed unpacking runfiles currently if we have high jobs parallelism
I think it doesn’t right now. I can change that if it’s a problem.
On Wed 23. Oct 2019 at 20:35, ianoc-stripe notifications@github.com wrote:
@ulfjack https://github.com/ulfjack should --local_cpu_resources effect build-runfiles ? it doesn't seem to unless i'm doing this incorrectly. -- when we have a lot of remote cache hits the local machine gets overwhelmed unpacking runfiles currently if we have high jobs parallelism
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bazelbuild/bazel/issues/6394?email_source=notifications&email_token=ABYD2YPANNB747C5N35JR43QQCDNLA5CNFSM4F3UNPIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECCH7WQ#issuecomment-545554394, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYD2YMHRLSZYK2Y6NSOAVTQQCDNLANCNFSM4F3UNPIA .
It does for us, i'm not sure if others don't see it. but for tests + binaries across a repo. If you pull and get a lot of cache hits having jobs set to say 1000, means the machine locks up more or less with the unpacking of runfiles. So having the runfiles step included in local cpu usage would be great as a means to avoid that. (Now we tell folks to ctrl+c and try run with --jobs 10
when they see it which isn't ideal).
So I would love if you could change it, thank you
45a9bc2ca456e76b82bd6c479cacd6081d79e9f5 was a change to the resource requirements of runfiles trees, which allowed more parallelism. Probably this should be discussed in another issue, though.
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 1+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage
) if you think this issue is still relevant or you are interested in getting the issue resolved.
We are working on this again, in a different way. Next step is to upgrade to a modem JDK which will allow us to use loom/virtual threads.
Reading through this thread.
If I'm using local execution with a remote cache and almost all tests are cached. with --jobs set to 4, I see. (currently on bazel 6.0.0)
21s remote-cache, linux-sandbox ... (8 actions running)
29s remote-cache, linux-sandbox ... (8 actions running)
With jobs set to 8, I see a max of 16 actions. This would mean jobs fetching from remote cache is not counting towards parallelism. Is there a way we can configure it to do so?
When a fairly large application is built using the remote cache, and all the actions are fully cached. Bazel still keeps the parallelism set to the number of cores in the machine. However, most actions are just waiting for network I/O.
With the default settings on an 8-core machine, I get:
But if I bump up the number of jobs to a crazy number:
I think the
--jobs
option should only apply tolocal
actions.