bazelbuild / bazel

a fast, scalable, multi-language and extensible build system
https://bazel.build
Apache License 2.0
23.01k stars 4.03k forks source link

Using --local_extra_resources limits concurrency #18153

Open cameron-martin opened 1 year ago

cameron-martin commented 1 year ago

Description of the bug:

If some tests require extra resources (via --local_extra_resources) but others don't, the concurrency of tests that do not require the extra resource is limited by tests that do require the extra resources being scheduled but not starting. These tests that are scheduled but not started count as a concurrent running job, but sit there doing nothing when a job that does not require the resource could be running.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

A reproducer is available at the following repository: https://github.com/cameron-martin/bazel-extra-resources-scheduling-bug

Tests can be run like so:

bazel test //:all

Half of these tests do not require extra resources, so concurrency should not be limited until these complete. Instead, the number of concurrent jobs drops to way below the maximum since tests that require an unavailable resource are scheduled but cannot yet start.

Which operating system are you running Bazel on?

Ubuntu 22.04

What is the output of bazel info release?

release 6.1.2

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

No response

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

https://bazelbuild.slack.com/archives/CA31HN1T3/p1681922669112529

brentleyjones commented 8 months ago

Is this still an issue? And does https://github.com/bazelbuild/bazel/commit/0725711d76f8738be0f57cf210efaea0e0a32742 or https://github.com/bazelbuild/bazel/pull/20398 change anything?

cameron-martin commented 8 months ago

Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that 0725711d76f8738be0f57cf210efaea0e0a32742 means it now happens for memory and CPU if it didn't already!

zhengwei143 commented 8 months ago

Is this still an issue? And does https://github.com/bazelbuild/bazel/commit/0725711d76f8738be0f57cf210efaea0e0a32742 or https://github.com/bazelbuild/bazel/pull/20398 change anything?

Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that https://github.com/bazelbuild/bazel/commit/0725711d76f8738be0f57cf210efaea0e0a32742 means it now happens for memory and CPU if it didn't already!

The commit / PR mentioned doesn't change anything, it just consolidates the flags --local_{extra,ram,cpu}_resources into a single flag --local_resources. Under the hood, the previous flags are all managed by the ResourceManager so it can happen for memory / CPU, just not as pronounced as because likely no one really keeps count.

wilwell commented 8 months ago

The issue connected not with resource foo but with resource CPU. By default we use 1 CPU for every action, so in your example all 200 jobs are concurring on 16 CPUs (or whatever is your limit on build) and 100 jobs are concurring on 1 Foo resource. I made an experiment and saw, that there are a lot of test which are trying to get CPU, but couldn't.

To summarize I want to say that this is intended behaviour because of concurrency on CPU.

cameron-martin commented 8 months ago

If you run that example for a while, you'll see the number of concurrently-running jobs decreases below the number of cpus available, even though there are still actions available that do not depend on the resource foo.

Essentially the actions scheduled that depend on the resource foo block actions from running that don't depend on foo. Please re-open this, it is a real issue.

zhengwei143 commented 8 months ago

I can confirm that this does happen (and have also discussed with @wilwell). When an action execution thread attempts to execute an action that requires local resources, it blocks the thread and waits until they are available, so it is blocking other actions that could be running that don't require any resources. The bottleneck here becomes the number of --jobs specified.

The ideal solution would be to have the ResourceManager be smart enough to figure out when to block or when not to and return the thread to skyframe to execute another action instead (which requires some additional work to pipe through). However, we likely don't want to always return the thread in the absence of resources as that incurs cost of a skyframe restart - so the sweet spot is somewhere in the middle. Implementation of the solution would likely be along the lines of a heuristical analysis to decide which path to take.

That being said, this could be mitigated by increasing the number of --jobs used, and eventually through the use of virtual threads when that becomes available in Bazel (but that's a story for later).

How much does this impact performance of your builds (I assume the repro you gave is a more extreme example)? And does increasing --jobs help?

cameron-martin commented 8 months ago

The repro is a somewhat extreme example, but our build is bottlenecked around a comparatively small (compared to CPU) number of resources. How much this affects our build, I'm not sure since its hard to measure the case where this behaviour doesn't exist.

We only have only one local resource in high contention, so I imagine it wouldn't have a huge impact since you need to wait for the actions that require that resource to finish anyway. I can imagine this is a larger issue is you have multiple resources in high contention (e.g. foo and bar), since actions that are blocked on waiting for foo would block those that could be using bar. However, we don't have that situation yet.

zhengwei143 commented 8 months ago

I think that increasing --jobs could potentially help if the limited concurrency is affecting the critical path of your build - this would reduce the ratio of blocked actions.

  1. You could also use https://github.com/bazelbuild/bazel-bench to benchmark your build against a higher --jobs and see if it actually makes a difference (I'd be interested to see if this actually causes a regression).
  2. Alternatively, collecting a json trace profile might be a simpler way to look at the critical path of the build and get hints on whether it affects build performance.

While the issue is present, I don't think we have sufficient reasons to justify the implementation of a new feature to combat this ATM unless we see a significant impacts on build performance on non-niche cases. This is especially since async execution with virtual threads is on the horizon, which would probably mitigate this issue.

cameron-martin commented 8 months ago

Right, yes. I was thinking that increasing --jobs would cause more jobs to run than the number of CPUs, but I guess --local_cpu_resources will limit that still. Sounds like a reasonable workaround for now.

cameron-martin commented 8 months ago

Actually is that true? I seem to remember that the number of concurrent jobs (beyond the number of CPUs) can be increased solely by increasing --jobs. Do jobs by default not set cpus:1? I'll have to test this out tomorrow.

zhengwei143 commented 8 months ago

Do jobs by default not set cpus:1?

Perhaps you were thinking about this?

IIUC, --local_cpu_resources only limits local actions that acquire CPU resources, and restricts concurrency of actions based on your HOST_RAM (unless you've explicitly specified a different --local_cpu_resources). If you have a lot of local actions, increasing --jobs will probably increase concurrency up until the CPU resource itself becomes the next bottleneck - which is what you mentioned.

If the other actions are remote, they aren't limited / blocked because only the local action execution code paths call ResourceManager#acquireResources.

--jobs just specifies the number of threads used by Blaze to execute concurrent actions, whether or not each thread acquires resources from the ResourceManager depends on how the action is run (local / remote).

cameron-martin commented 8 months ago

Makes sense, thanks!