Open jonboulle opened 9 years ago
Hey guys,
great to see that draft of requirements. I recently thought about different scheduling strategies and how to enable fleet to do so. Since we are running various clusters of coreos we are really interested to be able to schedule units based on different policies. Reading the requirements above leads to the following brain dump how such a plugin could be designed and work.
For different needs there should be different strategies to schedule units. This strategies could easily be defined in the [X-Fleet]
section. At a later point this strategies maybe could be combined, but lets keep it simple in the first step. If no strategy is defined, the current implementation is used.
Here are possible strategies crossing my mind:
strategie | description |
---|---|
lowestCPU |
schedules units on the host with lowest CPU usage |
highestCPU |
schedules units on the host with highest CPU usage |
lowestCores |
schedules units on the host with lowest CPU cores |
highestCores |
schedules units on the host with highest CPU cores |
lowestMem |
schedules units on the host with lowest memory usage |
highestMem |
schedules units on the host with highest memory usage |
lowestCount |
schedules units on the host with lowest unit count |
highestCount |
schedules units on the host with highest unit count |
roundRobin |
schedules units round robin style |
default |
schedules units based on current implementation |
This is how it could look like in a unit file:
[X-Fleet]
SchedulerStrategie=lowestCPU
fleetd
The scheduler in fleetd
needs to handle strategies. Maybe it would be nice to use a feature that already exists. Imagine fleetd
could ask some service to which MachineID
a unit could be scheduled to best. If fleetd
detects a given SchedulerStrategy
in a given unit file, it asks a resource-ruler
for the machine ID, best fulfilling the given strategy requirements. That just would lead to an additional call to an external resource to get a machine ID during the actual scheduling. To let fleetd
know where to ask for a machine ID, one could configure it on startup. Further it would be necessary to ensure each fleetd
is configured using a resource-ruler
or not. There might be different policies too, to wether judge on consistent usage of resource-ruler
's to throw an error and refuse startup or not.
This here could be placed in a config file:
resource_ruler_endpoint=http://127.0.0.1:7007
resource_ruler_consistency=true
The ruler could be just another service running parallel to fleetd
and etcd
. It could be configured to collect a set of different data necessary to schedule units in configurable intervals of time to store them in etcd
. Think about a HTTP server providing different routes with plugged in rulers for different strategies.
Starting the resource-ruler
could look like this:
resource-ruler --strategy-set=all --collector-interval=10 --host=127.0.0.1 --port=7007
As far as I can see, this proposal fulfils nearly every requirement listed above. Anyway there will be parts I have missed, because of the complexity if distributed systems. Further, often one disadvantage of additional functionality is increasing latency for the given action. In our case this approach would be as so called "good enough", since there is just one additional HTTP call. That HTTP call needs to wait for another HTTP call to etcd
, to crunch little data in the speed of light (because we are using go :rocket:).
So far so good, thanks for coreos <3
@zyndiecate I don't think there is much benefit in doing metrics based on cores - when taken in isolation it's not particularly useful. If your cpushares are set correctly then it's the number of cores divided by the number of units on the machine that influences cpu time.
I think being able to chain a series of simple composable schedulers (as described in https://github.com/coreos/fleet/issues/1049) is really going to drive the most value - think a moog synthesizer for scheduling ;). This keeps the core scheduler stable, and allows for community extensions
Thank you @jonboulle for the update.
I would like one clarification on the requirements. Under Behavior Preserving you call out "No changes to the fleet API". I believe this should be clarified to: "No changes to the user-facing & user APIs", i.e. the API used to schedule, retrieve, and view units should stay as it is today to limit the impact on existing users.
I think it makes sense to expand the API to allow better interaction between fleet and the external schedulers. This would be a separate API (separate port like etcd client vs server ports?) used only by the schedulers to register themselves and publish/retrieve scheduler data.
I like the idea of calling out which schedulers to run and in which order via a new parameter in X-Fleet, similar to @zyndiecate's example above with multiple values. This removes the need to specify any sort of priority when registering as well as allow different types of units to request specific types of scheduling.
I'd like to propose the following additions to the requirements:
I would also like to (re-) propose that the default batteries included scheduler receive a few upgrades. A scheme similar to #945 would allow for more out of the box uses without modifying the default behavior as it stands today.
I would like one clarification on the requirements. Under Behavior Preserving you call out "No changes to the fleet API". I believe this should be clarified to: "No changes to the user-facing & user APIs", i.e. the API used to schedule, retrieve, and view units should stay as it is today to limit the impact on existing users.
This is accurate, but it's your follow on point (a new and expanded scheduler-specific API) that concerns me a little. To quote myself from the end of the OP:
We do not anticipate the API between fleet and the pluggable schedulers becoming arbitrarily complex (ideally it should be limited to the single request-response described above), and we would still consider fleet to be the “entrypoint” for users deploying applications (c.f. Mesos, where the entrypoint is typically the scheduler). To put it another way, schedulers should plug in behind fleet rather than on top of fleet.
If I understand @zyndiecate's proposal correctly, it wouldn't involve such an increase in complexity as the interface would be limited to a single call.
I would also like to (re-) propose that the default batteries included scheduler receive a few upgrades. A scheme similar to #945 would allow for more out of the box uses without modifying the default behavior as it stands today.
Could you explain a little more on why you feel #945 should be baked into the core rather than implemented as a chainable scheduler?
I don't see the API becoming arbitrarily complex, however I still maintain that fleet itself should provide some rudimentary persistence. Persistence would consist of single-level key value pairs per machine, collected and sent to fleet from each machine. Having this persistence would simplify the setup and use of the schedulers and increase performance (scheduler could request data at registration time and appropriate data could be sent along on a scheduler request). I think this is worth the slight complexity that it would add to fleet itself.
The following two endpoints would be added to Fleet
Each scheduler would implement just one endpoint
The fleet unit itself specifies the order of the schedulers to run. The list of machines output from a scheduler is fed into the next in the chain, even if there is only 1 machine to ensure the machine meets all qualifications. If at any point a scheduler returns zero machines, the unit cannot be scheduled.
@jonboulle As for #945 I think that the default scheduler as it exists today just isn't enough. While I agree that the reservation system could (and probably should) be moved to a separate chained scheduler, the job multiplier feels like an easy, low complexity, addition with a lot of benefit.
Allowing fleet to load balance asymmetric work loads would open it up for a lot more applications out of the box.
While I think that this RFC is pretty cool, it does not really look like its getting traction. Is there any way that we could have something simpler to deal with just over provisioning of memory?
Right now I have a problem where we have 10 machines running cores os, and about 12 services running multiple versions. 4 services can fit on 1 machine, they are all about the same size. This means that we have the capacity to run 40 service copies, much more than the 24 or so that we need. Even with this if a machine or two fails or restarts, then a cascading failure occurs as it piles too many services on one machine.
I was hoping to work on this next, but am currently stalled out on #1077. I am hoping now that etcd2.0 has shipped in alpha that @jonboulle and @bcwaldon can give me a bit of direction on both issues.
I see that there was an attempt to bring machine resources into play, but it was deleted in ab275c1d510a72d5ff221c18490efcf0f08f8d01 (although resource
dir is still there). I wonder why you decided to do that.
@jonboulle , seems you forget to list the #943 requirement.
/cc @htr
fleet’s current scheduling engine is rudimentary, and there have been various proposals (1 2 3 4) to either enhance the complexity of the scheduling within fleet, or provide a means for users to extend it without needing to run a custom branch of fleet.
This issue aims to capture the design requirements and restrictions for a solution to these requests, such that it can be implemented in a way a) keeping with fleet’s architecture and design goals, and b) without impacting existing users of fleet.
(Bear in mind that this is a work in progress proposal, not a final set of hard requirements; please provide feedback below).
The solution should be:
fleetd
) for fleet itself)X-Fleet
options should be sufficientSince comparisons will inevitably arise, we will take this opportunity to draw a distinction between what we’re aiming for and the Mesos framework model. We do not anticipate the API between fleet and the pluggable schedulers becoming arbitrarily complex (ideally it should be limited to the single request-response described above), and we would still consider fleet to be the “entrypoint” for users deploying applications (c.f. Mesos, where the entrypoint is typically the scheduler). To put it another way, schedulers should plug in behind fleet rather than on top of fleet.