Open cameron-martin opened 1 year ago
Is this still an issue? And does https://github.com/bazelbuild/bazel/commit/0725711d76f8738be0f57cf210efaea0e0a32742 or https://github.com/bazelbuild/bazel/pull/20398 change anything?
Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that 0725711d76f8738be0f57cf210efaea0e0a32742 means it now happens for memory and CPU if it didn't already!
Is this still an issue? And does https://github.com/bazelbuild/bazel/commit/0725711d76f8738be0f57cf210efaea0e0a32742 or https://github.com/bazelbuild/bazel/pull/20398 change anything?
Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that https://github.com/bazelbuild/bazel/commit/0725711d76f8738be0f57cf210efaea0e0a32742 means it now happens for memory and CPU if it didn't already!
The commit / PR mentioned doesn't change anything, it just consolidates the flags --local_{extra,ram,cpu}_resources
into a single flag --local_resources
. Under the hood, the previous flags are all managed by the ResourceManager
so it can happen for memory / CPU, just not as pronounced as because likely no one really keeps count.
The issue connected not with resource foo
but with resource CPU
. By default we use 1 CPU for every action, so in your example all 200 jobs are concurring on 16 CPUs (or whatever is your limit on build) and 100 jobs are concurring on 1 Foo resource.
I made an experiment and saw, that there are a lot of test which are trying to get CPU, but couldn't.
To summarize I want to say that this is intended behaviour because of concurrency on CPU.
If you run that example for a while, you'll see the number of concurrently-running jobs decreases below the number of cpus available, even though there are still actions available that do not depend on the resource foo.
Essentially the actions scheduled that depend on the resource foo block actions from running that don't depend on foo. Please re-open this, it is a real issue.
I can confirm that this does happen (and have also discussed with @wilwell). When an action execution thread attempts to execute an action that requires local resources, it blocks the thread and waits until they are available, so it is blocking other actions that could be running that don't require any resources. The bottleneck here becomes the number of --jobs specified.
The ideal solution would be to have the ResourceManager
be smart enough to figure out when to block or when not to and return the thread to skyframe to execute another action instead (which requires some additional work to pipe through). However, we likely don't want to always return the thread in the absence of resources as that incurs cost of a skyframe restart - so the sweet spot is somewhere in the middle. Implementation of the solution would likely be along the lines of a heuristical analysis to decide which path to take.
That being said, this could be mitigated by increasing the number of --jobs
used, and eventually through the use of virtual threads when that becomes available in Bazel (but that's a story for later).
How much does this impact performance of your builds (I assume the repro you gave is a more extreme example)? And does increasing --jobs
help?
The repro is a somewhat extreme example, but our build is bottlenecked around a comparatively small (compared to CPU) number of resources. How much this affects our build, I'm not sure since its hard to measure the case where this behaviour doesn't exist.
We only have only one local resource in high contention, so I imagine it wouldn't have a huge impact since you need to wait for the actions that require that resource to finish anyway. I can imagine this is a larger issue is you have multiple resources in high contention (e.g. foo and bar), since actions that are blocked on waiting for foo would block those that could be using bar. However, we don't have that situation yet.
I think that increasing --jobs
could potentially help if the limited concurrency is affecting the critical path of your build - this would reduce the ratio of blocked actions.
--jobs
and see if it actually makes a difference (I'd be interested to see if this actually causes a regression).While the issue is present, I don't think we have sufficient reasons to justify the implementation of a new feature to combat this ATM unless we see a significant impacts on build performance on non-niche cases. This is especially since async execution with virtual threads is on the horizon, which would probably mitigate this issue.
Right, yes. I was thinking that increasing --jobs
would cause more jobs to run than the number of CPUs, but I guess --local_cpu_resources
will limit that still. Sounds like a reasonable workaround for now.
Actually is that true? I seem to remember that the number of concurrent jobs (beyond the number of CPUs) can be increased solely by increasing --jobs
. Do jobs by default not set cpus:1
? I'll have to test this out tomorrow.
Do jobs by default not set cpus:1?
Perhaps you were thinking about this?
IIUC, --local_cpu_resources
only limits local actions that acquire CPU resources, and restricts concurrency of actions based on your HOST_RAM (unless you've explicitly specified a different --local_cpu_resources
). If you have a lot of local actions, increasing --jobs
will probably increase concurrency up until the CPU resource itself becomes the next bottleneck - which is what you mentioned.
If the other actions are remote, they aren't limited / blocked because only the local action execution code paths call ResourceManager#acquireResources
.
--jobs
just specifies the number of threads used by Blaze to execute concurrent actions, whether or not each thread acquires resources from the ResourceManager
depends on how the action is run (local / remote).
Makes sense, thanks!
Description of the bug:
If some tests require extra resources (via
--local_extra_resources
) but others don't, the concurrency of tests that do not require the extra resource is limited by tests that do require the extra resources being scheduled but not starting. These tests that are scheduled but not started count as a concurrent running job, but sit there doing nothing when a job that does not require the resource could be running.What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
A reproducer is available at the following repository: https://github.com/cameron-martin/bazel-extra-resources-scheduling-bug
Tests can be run like so:
Half of these tests do not require extra resources, so concurrency should not be limited until these complete. Instead, the number of concurrent jobs drops to way below the maximum since tests that require an unavailable resource are scheduled but cannot yet start.
Which operating system are you running Bazel on?
Ubuntu 22.04
What is the output of
bazel info release
?release 6.1.2
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
https://bazelbuild.slack.com/archives/CA31HN1T3/p1681922669112529