temporarily putting a node on hold

sigma commented 9 years ago

In its current form, the fleet scheduler does only take into account the number of already running units to identify the least-loaded node in a cluster.

If I understand correctly, the goal is to keep fleet as simple as possible, while allowing higher-level schedulers to sit on top of it. In that sense, one possibly missing feature is the ability to define if a node is in a position to accept a job at a given instant. The set of constraints that we currently have access to are more of the spatial kind: they only express where a job could run. The consequence is that if a node can run "something", then it can run all the instances of that "something", which can lead to excessive resource contention.

Use-case: I have a cluster subset that's basically a worker pool. I have N jobs of the same kind to schedule on that pool, and don't need to run them all at the same time (only as many as possible, as per resource limitations). Actual resource consumption can also vary a lot depending on the job instance. Right now I'm unable to tell a node (or even the cluster btw) to stop accepting jobs whenever some resource threshold is met: AFAICT the best I could do is to limit the parallelism to 1 job/node (using Conflicts), which is not fine-grained enough.

I tried to find the minimal addition to the API that would allow one to implement resources-aware placement strategies, and decided to go with some kind of locking mechanism: each node can independently be put on hold, temporarily, in such a way that whatever is already running keeps running, but no additional load is accepted. This way, one can either wait for some capacity to be freed (and for the lock to be removed), or add more nodes (hence capacity). Whereas in the current situation, adding capacity is the only option.

https://github.com/sigma/fleet/commit/d74fe2baa297991f5ae1dd26c2511ca536b05ead provides a very primitive and ugly way of doing that. But essentially, the distinction that I think makes sense is:

capability/spatial metadata, that are used at placement time but also need to reschedule existing tasks whenever they change (those are the current metadata)
capacity/temporal metadata, that are used exclusively at placement time, as simple acceptance criteria (but could also be used as some kind of OOM trigger for example)

Any feedback/idea welcome :) (I definitely don't expect anybody to merge the above, I'm just looking for the right enabler for that kind of use-cases)

bcwaldon commented 9 years ago

While I agree there are use-cases that exist where being able to place a node on hold would be useful (going down for maintenance), I do not believe the use-case you've outlined qualifies.

Your use-case simply requires that you define some heuristic for measuring how much work can be done on a given node. I don't see how this differs from the traditional resource-based scheduling approach. In your case, you don't need X amount of memory, you want to fill one of N slots. Maybe it would help to elevate the concept into the API to make the UX feel right, but I don't imagine this need to be a completely separate mechanism.

sigma commented 9 years ago

@bcwaldon I'm not sure I follow. Even if that problem could be reduced to a slot problem (which is not really the case since as I mentioned the resources needed for an individual job vary a lot from one instance to another, so that the number of "slots" cannot be static), fleet would still not help since I cannot express that a node can have 5 slots and another 7.

And yes, definitely it's not different from resource-based scheduling. Actually my whole purpose is precisely to implement that kind of scheduling on top of fleet since it doesn't provide that in its current state.

bcwaldon commented 9 years ago

@sigma I clearly missed the piece about your jobs requiring different amounts of resources. I guess I'm not sure what you're asking for here. I'm totally for implementing the "hold" feature, but for a different use-case. Maybe you are more interested in the scheduler discussion going on in https://github.com/coreos/fleet/issues/1055

sigma commented 9 years ago

@bcwaldon absolutely, a pluggable scheduler would definitely be a much better way to implement resource-based scheduling properly, I'll follow that discussion. Something I didn't mention is that I'm also interested in putting a node on hold for another reason: I want the worker part of my cluster to be elastic, so that at some point nodes should be destroyed. Without the ability to stop the incoming stream of units, that would mean killing some of them in flight (since the curent scheduler tries its best to use all nodes available). I guess this one is closer to the use-case you're mentioning: maintenance (whether the node comes back at some later point or not is not really relevant)

bcwaldon commented 9 years ago

I think #1069 covers the decommission piece of this issue, while the scheduling piece is being discussed in #1055.

coreos / fleet

temporarily putting a node on hold #1026