Efficient resource utilisation, re-balancing mandate for Engine

yaronr commented 9 years ago

I'm looking for a way for improving Engine scheduling, so that it's decisions could based on machine metrics (available CPU, memory) rather than 'number of running units'. For the longer term, I would love to see this exhibit the ability to plug-in different scheduling heuristics (first globally, but maybe later - per unit). As a first step, an MDP if you will, if each machine could report a 'score' on itself, then the engine could just pick the machine with the highest score. This would be a huge improvement over today's scheduling. On the even longer term I'm looking to have a re-balancing feature, where by scheduling would have mandate to re-assign units to other machines. This is probably a separate development thread, but it could build on the previous one I mentioned.

So to give a couple of examples:

My cluster is composed of two single core computers with 2Gb of memory, and one dual core with 4Gb of memory. Fleet would run twice as many units on the stronger computer.
My cluster is running at near capacity. I add a node to the cluster. fleet re-balances the load by moving some units around
Some of my units consume a lot of CPU and memory, and some consume very little. Fleet would schedule units based on their resource utilisation and free capacity.

Note, Even a partial implementation of this approach would give huge advantages, such as - being able to efficiently use machines of different capacities within the same cluster.

This is related to #555, I think #555 is a required step to achieve what I have outlined above.

PierreKircher commented 9 years ago

ref #747

UPDATE 9/24:

remove note about unfair scheduling as offering/bidding mechanism is gone
remove note about supporting memory-based scheduling

There are two major aspects of scheduling for fleet to focus on: resource scheduling and dependency scheduling.

As far as resource scheduling goes, fleet is not going to have a full-featured scheduler. We have no plans to support any resource-related parameters past the leveling of the number of units scheduled to a particular machine.

gucki commented 9 years ago

Imo just counting the number of units is simply not enough when you have quite different sized hosts and/ or containers.

What about being able to specify a weight for containers and a weight limit for hosts?

What about implementing a simple hook/ callback which would allows users to write their own scheduler?

yaronr commented 9 years ago

@gucki I like the last - implementing a simple hook / callback. In fact, that's what I was implying to by 'plug-in'.

@PierreKircher , although it's perfectly ok for the official CoreOS team not to want to develop this capability, I still think it should be opened up to the community, and individual developers to add their own. The default behaviour could be 'number of units'.

gucki commented 9 years ago

@PierreKircher @bcwaldon I could work on a patch/ PR for the weighting stuff. Would you be willing to merge that? I don't know the code well, but I assume it's only a minor code change but still a big win over the current limitation. The plugin/ hook/ callback would be the best solution and I'd also be happy to contribute here if it'd be merged.

PierreKircher commented 9 years ago

im not from the coreos team ..

but if you have a weight extension to fleet id be interested as well ..

There are multiple ways to rome ..

hardscheduling:

.. to example .. > docker allows to limit a container with mem / cpu shares

cadvisor would tell you how much resources are used .. counting the limitation vs the max available would be a option

hard scheduling works via metatada . if you tag your node and reflect that in the x-fleet conditions of the unitfile .. all it takes is a "stop and start" of that given unit to get moved

at least that is a start i guess

gucki commented 9 years ago

@PierreKircher That'd work for the first/ manual start only. But what about an automatic (re)start in case of a node failure? The container could not be moved to another alive node, as it's still bound to the dead node - the one we chose when starting it manually Please correct me if I'm wrong.

PierreKircher commented 9 years ago

well my humble approach to that would be .. have a global controll container .. which does nothing expect watching etcd values .. (fleet units) + ttls with ips ..

if something changes .. id .. replace the old unitfile and resubmit a new one from that "controll instance" ..

its probably not the smartest / efficient way to handle it .. but its a workaround

yaronr commented 9 years ago

Gentlemen, I'm pleased, but not surprised that my request started such a lively discussion. It comes to show that enabling this through some 'plug' (strategy) would be worthwhile in terms of future returns, by individual strategies for scheduling and metadata.

Small change, big payoff. #555 is on the same note as @PierreKircher I think, with regard to using metadata.

dbason commented 9 years ago

FWIW we are currently looking at forking to provide the ability to weight units and hosts and also provide memory constraints - the approach we are looking at would be similar to how mesos does its basic resource scheduling. We're going to keep this in separate modules and then just use that to provide a filtered list of available machines to the fleet scheduler internally. We're hoping this approach will make it easy enough to keep the fork in sync with the main project.

bcwaldon commented 9 years ago

First, a bit of an apology: We haven't done a great job at providing an overall direction of the fleet scheduler. We have been trying to be extremely careful in only adding features that we know we can support in a sustainable way, and without making the scheduler cumbersome or too specific for simpler use-cases. Keeping the featureset small makes it easier to ship quality software, but it has become clear that an answer to complex scheduling requirements (i.e. weights, resources, etc) must be part of the fleet MVP.

I'm still not quite convinced that adding this logic into the core fleet scheduler makes the most sense, but I do want to figure out how to enable complex scheduling to ensure that fleet solves real problems for people and isn't just useful for demos and toy deployments.

Given the interest in growing the fleet scheduler in many different directions, it seems like the most logical thing to do is to make the scheduler "pluggable" through the HTTP API. This would give everyone the flexibility to develop the schedulers that we need, but also prevent anyone from having to maintain a fork, build their own images, etc. The other major piece is how we expose the information needed to inform scheduling decisions, but this could be gathered and published rather easily by the fleet agents using the /machines HTTP endpoint.

This is a high-level summary and there's obviously still a lot to talk about here. Is this a direction that sounds like it would fit your needs? I also want to be clear that we're definitely open to accepting patches to help move things along.

dbason commented 9 years ago

I think you guys are doing a great job creating a lightweight scheduler; having a pluggable scheduler would be fantastic. I do think that as @gucki mentioned rescheduling units in the event of a node failure will be a tricky use case. Either the plug would need to be regularly polling or it would need a way to subscribe to a notification from fleet that a unit needs to be rescheduled.

bcwaldon commented 9 years ago

@dbason Nobody said this would be easy :) @jonboulle and I will spend some time planning this out on friday and get a more fleshed out plan together to talk through

yaronr commented 9 years ago

@bcwaldon Thanks for picking this up. I obviously share your thoughts on the implications of this PR to the potential usefulness of CoreOS and Fleet in real world situations. At this point it seems we all agree on the 'What' and 'Why', which is great. I'm looking forward to hearing more and providing my feedback if required.

At this point I would like to add that having the ability to configure scheduling as part of the cloud-init is probably a requirement as well, and that it should be taken into consideration that node metadata is very important for scheduling decisions and therefore some mechanism (or at least place-holders) for that need to be part of the design.

cheribral commented 9 years ago

This is how we have been toying with the idea: https://github.com/cheribral/fleet/tree/resource-schedule. A unit will only be picked up by something with available resources. It would definitely be nice if the whole mechanism were pluggable somehow. I'm not sure how the scheduler would get all the information it needs though without touching the insides of things.

Also, it would nice to have a way to know that something couldn't be scheduled, rather than having it end up sitting in registry limbo. I see a handful of returned strings with nice reasons for failures to schedule, but they don't seem to be used. That is a separate issue though.

epipho commented 9 years ago

What if the schedulers were chainable? The config file / cloud-config would specify list of scheduler endpoints that will be called in order.

The input to each scheduler would be the unit definition and list of eligible machines in priority order from the previous scheduler. The output would be the filtered result of the machine list, again in priority order.

The two existing schedulers would remain as the endcaps for the whole pipeline.

First the existing dependency scheduler runs, all active machines are passed to it. The dependency scheduler outputs the list of machines that meet all the criteria.

Next the list of pluggable schedulers is run, further reducing the number of eligible machines. Examples would be eliminating machines that have a high cpu load, machines that do not have enough free memory, or machines that are locked for coreos update.

Lastly the default fleet scheduler runs, placing the unit on the appropriate host using the default rules if more than one machine made it to the end.

If at any point a scheduler emits 0 machines the process is ended and the unit is not scheduled. Depending on a setting in the job or on fleet itself it could be sent back through the pipeline at a later time or could simply error out and fail to start.

This would allow very simple schedulers to be created that can be built into complex behavior.

epipho commented 9 years ago

If the default handler were to take the priority order of the machines into account #943 could be implemented very easily as a pluggable scheduler.

You could even go as far as making each dependency check a different scheduler which would allow you to generate the data for debugging unit startup as in #912 easily.

gucki commented 9 years ago

I like the idea of chainable schedulers. The API could be a simple HTTP json request I suppose.

dbason commented 9 years ago

Also +1 for chainable schedulers.

As @cheribral mentioned we would need some feed back from Fleet when a unit is unable to be scheduled though.

robhaswell commented 9 years ago

+1 chainable schedulers. Being able to write a scheduler which was concerned with only one core competency, and leave other considerations to other schedulers, would be a great capability.

Dunedan commented 9 years ago

@epipho I really like your idea regarding leveraging the default CoreOS schedulers in a chain of schedulers and of course the idea of chainable schedulers at all.

I'm not so sure about the idea of specifying scheduler endpoints using cloud-init, because that'd mean that they have to run outside the cluster (at least during bootstrapping the cluster), adding additional external dependencies. I would rather like to see something without such external dependencies.

yaronr commented 9 years ago

Guys, I just wanted to check up and see where this was headed. Any thoughts regarding implementation priority, expected delivery time?

jonboulle commented 9 years ago

Hey guys, really great to see this proposal and the discussion! I wanted to give a little feedback/input from @bcwaldon and me (as the current maintainers of fleet) and share where we’d like to see this go.

As we’ve expressed elsewhere, we definitely want to provide some means of enabling complex scheduling requirements with fleet, but without baking them into the core itself, so we can keep it minimal and stable. Essentially, some modular means of allowing people to hook in their own extensible schedulers. The two of us have struggled to come up with a good general solution to this problem that we’ve both been happy with.

To be totally transparent and clear: recently we have not been able to spend the time on fleet that we would like. As a small startup we have had to spend our limited resources on various projects (rocket, for example, and the forthcoming release of etcd), and fleet has unfortunately been a casualty of this.

The upside of this is that we are keen to get more people onboard taking an active and leading role in fleet development. To that end, we are looking to the community to help come up with a design and implementation for this solution, and we would like to have more individuals from outside the company formally join the fleet MAINTAINERS.

The amount of thought and effort you guys are putting into this really demonstrates how seriously you are taking fleet and we would love to work with you to craft the best solution possible.

We have put together another issue describing our requirements for such a system and we would love to hear if anyone is interested in stepping up and taking the lead on this.

yaronr commented 9 years ago

@jonboulle You guys are doing a great job, and I am sure you are overwhelmed with work. But it's important to be able to distance yourself from the day-to-day humming of bugs, features, and change requests - and look at this at a more strategic level. Smart, pluggable scheduling would quantum-leap CoreOS, and the required effort seems minimal. To put bluntly, as a user of CoreOS, smart (useful) scheduling is much more important than a better etcd. I think it's fine to let the community contribute here, but waiting for the community could be a business mistake.

I'm available for a phone call if you like, I could elaborate more.

bcwaldon commented 9 years ago

@yaronr

The primary function of fleet right now is to provide a cluster-wide abstraction of systemd powerful enough to manage today's higher-level scheduling frameworks (Kubernetes, Mesos, Deis, etc). These frameworks are relatively simple to manage right now, so we haven't had to expand fleet's scheduling capabilities any further. This is not to say that we are actively against adding these capabilities (#1055), just that we can no longer prioritize our time to work on it.

To touch briefly on the critical importance of etcd: if you'll pardon the hyperbole, but without solid and stable consensus, one cannot build meaningful distributed systems. A huge number of fleet-related support requests we receive (not just in the fleet repo [0], but also in others like coreos/bugs, coreos/coreos-cloudinit, and coreos/coreos-vagrant) boil down to underlying configuration or stability issues with etcd. Making the etcd experience seamless is paramount for fleet to work well everywhere. This is not to mention the many other users of etcd outside of fleet, including some large outside projects (Kubernetes, Cloud Foundry, etc).

We're more than happy to have your input on #1055, as we definitely want to make sure the solution we go with works for you.

[0] https://github.com/coreos/fleet/issues?q=is%3Aissue+label%3Aetcd

yaronr commented 9 years ago

Hi @bcwaldon

Thank you again for not killing the initiative.

I know that what I'm suggesting leads towards a lot more flexibility and usability in fleet, and that there are supposably platforms like Mesos and Kube that claim to do scheduling very well, but: 1) I have been unsuccessful running Mesos on CoreOS, and I am not aware of an easy / sensible way of doing it. (it doesn't mean it isn't possible, just that I haven't found a way to do it. If it is possible, please give me some pointers) 2) Scheduling at Mesos and Kube is very rudimentary, (arguably!) not much better than what Fleet currently does, and: 3) If fleet had the capabilities I'm discussing, then we would have a truly clustered OS, where the added value of Kube or Mesos would be marginal (and not worth the added complexity), and easily bridgeable.

Of course this is a matter of opinion, and I fully understand and accept the CoreOS point of view.

BTW. Can you please help me understand the reasons behind developing a new key/value store as part of CoreOs, instead of using one of the existing implementations? I couldn't find any discussion on the subject. This is just out of curiosity.

bcwaldon commented 9 years ago

I believe that the implementation of these application-oriented scheduling heuristics belong in one of these higher-order systems. There already exist large teams of incredibly smart people motivated to build the right solution, and I have faith that a viable solution will exist very soon if it does not already.

This has clearly left the realm of fleet at this point, so I'll keep it simple: there were no consistent/distributed key-value datastores that could be easily deployed and maintained in an ephemeral environment.

coreos / fleet

Efficient resource utilisation, re-balancing mandate for Engine #922