coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

RFC: Requirements for an extensible scheduling system #1055

Open jonboulle opened 9 years ago

jonboulle commented 9 years ago

fleet’s current scheduling engine is rudimentary, and there have been various proposals (1 2 3 4) to either enhance the complexity of the scheduling within fleet, or provide a means for users to extend it without needing to run a custom branch of fleet.

This issue aims to capture the design requirements and restrictions for a solution to these requests, such that it can be implemented in a way a) keeping with fleet’s architecture and design goals, and b) without impacting existing users of fleet.

(Bear in mind that this is a work in progress proposal, not a final set of hard requirements; please provide feedback below).

The solution should be:

Since comparisons will inevitably arise, we will take this opportunity to draw a distinction between what we’re aiming for and the Mesos framework model. We do not anticipate the API between fleet and the pluggable schedulers becoming arbitrarily complex (ideally it should be limited to the single request-response described above), and we would still consider fleet to be the “entrypoint” for users deploying applications (c.f. Mesos, where the entrypoint is typically the scheduler). To put it another way, schedulers should plug in behind fleet rather than on top of fleet.

xh3b4sd commented 9 years ago

Hey guys,

great to see that draft of requirements. I recently thought about different scheduling strategies and how to enable fleet to do so. Since we are running various clusters of coreos we are really interested to be able to schedule units based on different policies. Reading the requirements above leads to the following brain dump how such a plugin could be designed and work.

1. scheduler strategies

For different needs there should be different strategies to schedule units. This strategies could easily be defined in the [X-Fleet] section. At a later point this strategies maybe could be combined, but lets keep it simple in the first step. If no strategy is defined, the current implementation is used.

Here are possible strategies crossing my mind:

strategie description
lowestCPU schedules units on the host with lowest CPU usage
highestCPU schedules units on the host with highest CPU usage
lowestCores schedules units on the host with lowest CPU cores
highestCores schedules units on the host with highest CPU cores
lowestMem schedules units on the host with lowest memory usage
highestMem schedules units on the host with highest memory usage
lowestCount schedules units on the host with lowest unit count
highestCount schedules units on the host with highest unit count
roundRobin schedules units round robin style
default schedules units based on current implementation

This is how it could look like in a unit file:

[X-Fleet]
SchedulerStrategie=lowestCPU

2. configuring fleetd

The scheduler in fleetd needs to handle strategies. Maybe it would be nice to use a feature that already exists. Imagine fleetd could ask some service to which MachineID a unit could be scheduled to best. If fleetd detects a given SchedulerStrategy in a given unit file, it asks a resource-ruler for the machine ID, best fulfilling the given strategy requirements. That just would lead to an additional call to an external resource to get a machine ID during the actual scheduling. To let fleetd know where to ask for a machine ID, one could configure it on startup. Further it would be necessary to ensure each fleetd is configured using a resource-ruler or not. There might be different policies too, to wether judge on consistent usage of resource-ruler's to throw an error and refuse startup or not.

This here could be placed in a config file:

resource_ruler_endpoint=http://127.0.0.1:7007
resource_ruler_consistency=true

3. write a plugable resource ruler

The ruler could be just another service running parallel to fleetd and etcd. It could be configured to collect a set of different data necessary to schedule units in configurable intervals of time to store them in etcd. Think about a HTTP server providing different routes with plugged in rulers for different strategies.

Starting the resource-ruler could look like this:

resource-ruler --strategy-set=all --collector-interval=10 --host=127.0.0.1 --port=7007

As far as I can see, this proposal fulfils nearly every requirement listed above. Anyway there will be parts I have missed, because of the complexity if distributed systems. Further, often one disadvantage of additional functionality is increasing latency for the given action. In our case this approach would be as so called "good enough", since there is just one additional HTTP call. That HTTP call needs to wait for another HTTP call to etcd, to crunch little data in the speed of light (because we are using go :rocket:).

So far so good, thanks for coreos <3

dbason commented 9 years ago

@zyndiecate I don't think there is much benefit in doing metrics based on cores - when taken in isolation it's not particularly useful. If your cpushares are set correctly then it's the number of cores divided by the number of units on the machine that influences cpu time.

I think being able to chain a series of simple composable schedulers (as described in https://github.com/coreos/fleet/issues/1049) is really going to drive the most value - think a moog synthesizer for scheduling ;). This keeps the core scheduler stable, and allows for community extensions

epipho commented 9 years ago

Thank you @jonboulle for the update.

I would like one clarification on the requirements. Under Behavior Preserving you call out "No changes to the fleet API". I believe this should be clarified to: "No changes to the user-facing & user APIs", i.e. the API used to schedule, retrieve, and view units should stay as it is today to limit the impact on existing users.

I think it makes sense to expand the API to allow better interaction between fleet and the external schedulers. This would be a separate API (separate port like etcd client vs server ports?) used only by the schedulers to register themselves and publish/retrieve scheduler data.

I like the idea of calling out which schedulers to run and in which order via a new parameter in X-Fleet, similar to @zyndiecate's example above with multiple values. This removes the need to specify any sort of priority when registering as well as allow different types of units to request specific types of scheduling.

I'd like to propose the following additions to the requirements:

I would also like to (re-) propose that the default batteries included scheduler receive a few upgrades. A scheme similar to #945 would allow for more out of the box uses without modifying the default behavior as it stands today.

jonboulle commented 9 years ago

I would like one clarification on the requirements. Under Behavior Preserving you call out "No changes to the fleet API". I believe this should be clarified to: "No changes to the user-facing & user APIs", i.e. the API used to schedule, retrieve, and view units should stay as it is today to limit the impact on existing users.

This is accurate, but it's your follow on point (a new and expanded scheduler-specific API) that concerns me a little. To quote myself from the end of the OP:

We do not anticipate the API between fleet and the pluggable schedulers becoming arbitrarily complex (ideally it should be limited to the single request-response described above), and we would still consider fleet to be the “entrypoint” for users deploying applications (c.f. Mesos, where the entrypoint is typically the scheduler). To put it another way, schedulers should plug in behind fleet rather than on top of fleet.

If I understand @zyndiecate's proposal correctly, it wouldn't involve such an increase in complexity as the interface would be limited to a single call.

jonboulle commented 9 years ago

I would also like to (re-) propose that the default batteries included scheduler receive a few upgrades. A scheme similar to #945 would allow for more out of the box uses without modifying the default behavior as it stands today.

Could you explain a little more on why you feel #945 should be baked into the core rather than implemented as a chainable scheduler?

epipho commented 9 years ago

I don't see the API becoming arbitrarily complex, however I still maintain that fleet itself should provide some rudimentary persistence. Persistence would consist of single-level key value pairs per machine, collected and sent to fleet from each machine. Having this persistence would simplify the setup and use of the schedulers and increase performance (scheduler could request data at registration time and appropriate data could be sent along on a scheduler request). I think this is worth the slight complexity that it would add to fleet itself.

The following two endpoints would be added to Fleet

Each scheduler would implement just one endpoint

The fleet unit itself specifies the order of the schedulers to run. The list of machines output from a scheduler is fed into the next in the chain, even if there is only 1 machine to ensure the machine meets all qualifications. If at any point a scheduler returns zero machines, the unit cannot be scheduled.

epipho commented 9 years ago

@jonboulle As for #945 I think that the default scheduler as it exists today just isn't enough. While I agree that the reservation system could (and probably should) be moved to a separate chained scheduler, the job multiplier feels like an easy, low complexity, addition with a lot of benefit.

Allowing fleet to load balance asymmetric work loads would open it up for a lot more applications out of the box.

camerondavison commented 9 years ago

While I think that this RFC is pretty cool, it does not really look like its getting traction. Is there any way that we could have something simpler to deal with just over provisioning of memory?

Right now I have a problem where we have 10 machines running cores os, and about 12 services running multiple versions. 4 services can fit on 1 machine, they are all about the same size. This means that we have the capacity to run 40 service copies, much more than the 24 or so that we need. Even with this if a machine or two fails or restarts, then a cascading failure occurs as it piles too many services on one machine.

epipho commented 9 years ago

I was hoping to work on this next, but am currently stalled out on #1077. I am hoping now that etcd2.0 has shipped in alpha that @jonboulle and @bcwaldon can give me a bit of direction on both issues.

vassilevsky commented 9 years ago

I see that there was an attempt to bring machine resources into play, but it was deleted in ab275c1d510a72d5ff221c18490efcf0f08f8d01 (although resource dir is still there). I wonder why you decided to do that.

wuqixuan commented 9 years ago

@jonboulle , seems you forget to list the #943 requirement.

jonboulle commented 8 years ago

/cc @htr