Open alanfalloon opened 3 years ago
Susan, this is at least related to your work on worker resource management.
@alanfalloon Can you say how you acquire the assets?
Bazel could delegate timeout handling to the worker like it does in other cases (e.g., linux-sandbox). That would give you full control over that. The linux-sandbox also has a mechanism to return process timing information to Bazel, which could also be used. What's more difficult is a mechanism to allow the worker to signal that the action is still waiting.
The multiplex worker API might be a better match than the standard one.
I agree that multiplex workers would be the place for a lot of this. Sending back a bit of flow control information with a WorkResponse would be easy, that would then need to be fed back to the WorkerPool to adjust the number of available workers. That's doable, as long as we don't get other mechanisms trying to also do similar things (e.g. global CPU usage limit), then it gets messier. And this doesn't address what actions get scheduled when at a global level.
@ulfjack In our case there is a service reachable on the network which centrally manages the pool of resources. So you make a request for a resource, and it responds with the information on which one is yours. In cases of contention we generally poll, and there is a specific call to release resources but also a heartbeat that will reclaim them if you don't keep refreshing your claim. In our case there is also a one-time-per-lease setup that has to happen for each resource once we get the lease before it can be used in the actions.
I agree that the multiplex workers make the most sense. We need to have a single process anyway because it makes it easier to centrally manage the resources. Also, we are managing multiple slightly different resources so having more than one action queued lets us make smarter decisions about which actions can be dispatched immediately and which should be parked for a new resource request (opportunistic batching).
One option is to move timeout handling completely to the worker, like @ulfjack suggested. The other option is to update the protocol to allow multiple responses to a request: acknowledgement, started, and completed responses. That might work better with @larsrc-google suggestion to add the flow-control messages because then you get more opportunities to communicate your capacity. That means you need to start dealing with workers supporting different protocol versions though, so it might not be worth the additional complexity.
This is a great idea. But I'm afraid neither this, nor worker resource management, is going to be on my plate in the next couple of quarters. I'm going to unassign, and put it in the local execution component, so that it can be picked up by our triage process.
cc: @gregestren @juliexxia Some of this might be doable with execution groups.
@aiuto How do execution groups help in this case? IIUC execution groups allow you to configure different platforms and toolchains for actions within a rule, but I don't see how that helps me manage a shared resource.
They don't help with the timeout problems of license acquisition at all. I was thinking about the other problem lumped into this issue - queuing for specific hardware.
They might work well for actions that require specific hardware to run. Let's say you have a pool of 20 build machines but only 4 have a specific attached processor, like a TPU. We could use execution groups to define a need for the TPU, and then have the scheduler run the action only on executors providing that resource.
I see. Thanks for the explanation. That solution assumes that the resources can be mapped to executors, which isn't true in our case. The resource reservation system is not exclusively for Bazel use, they are needed for other workflows outside of Bazel as well; it is an entirely separate system that we want to integrate in to Bazel.
However, we had briefly considered a solution using gRPC proxy instead of multiplex-workers which would allow us to define virtual executors that acquire the resources before accepting the actions, and in that case I can see exec groups being helpful. We rejected that idea as being too much work, and kind of a hack. If multiplex-workers can be made to support this case, it seems like a much better fit.
Thank you for contributing to the Bazel repository! This issue has been marked as stale since it has not had any activity in the last 2+ years. It will be closed in the next 14 days unless any other activity occurs or one of the following labels is added: "not stale", "awaiting-bazeler". Please reach out to the triage team (@bazelbuild/triage
) if you think this issue is still relevant or you are interested in getting the issue resolved.
This issue has been automatically closed due to inactivity. If you're still interested in pursuing this, please reach out to the triage team (@bazelbuild/triage
). Thanks!
I'm bringing this back from the dead.
It's still a relevant issue and the team is working on related problems. In our case we have RBE clusters with different types of resources (e.g. TPUs, physical android devices, extra memory, ...) There is a need to be able to schedule build jobs to a cluster having the right things. Controlled access to keys for licensed compilers falls right in to this category of issues.
This can be done by defining a custom platform with exec_properties
, which populate the platform
field in the Command
message [1]. The RBE config for Bazel itself does this to schedule some build actions in a separate worker pool [2].
Is this not sufficient? Do you have something else in mind?
[1] https://cs.opensource.google/bazel/bazel/+/master:third_party/remoteapis/build/bazel/remote/execution/v2/remote_execution.proto;l=678;drc=d0cba5507fcb5d636b1a9a3b1f58cf63314781c0 [2] https://cs.opensource.google/bazel/bazel/+/master:BUILD;l=257;drc=b0fc11d8f386141d2c5efd39cbeed316d620888a
Description
Extend multiplex workers to handle the asset management use cases:
I am using "assets" to mean a limited shared resource like software licenses or specialized hardware.
Feature requests: what underlying problem are you trying to solve with this feature?
We have some actions that need access to limited shared resources such as license files or special hardware. In our case, acquiring the asset comes with considerable set-up cost which we wanted to amortize over multiple actions.
It seems like multiplex workers were the perfect tool for this. The idea was to write a worker that:
In our case we need to see the action before we can acquire the resource and set it up (we are actually managing multiple kinds of resources and separate queues for each), so we can't use a simpler strategy like acquiring the resource on worker start-up and not read stdin until we have it.
However we hit a couple of snags. The first is that we wanted to do this for test actions, so we are blocked by #7595. However there are also a couple of limitations on the workers that make it not a great fit:
What operating system are you running Bazel on?
macOS and Linux
What's the output of
bazel info release
?3.5.0
Have you found anything relevant by searching the web?
no