Netflix / Fenzo

Extensible Scheduler for Mesos Frameworks
700 stars 116 forks source link

How does Fenzo scale? #47

Closed huntc closed 9 years ago

huntc commented 9 years ago

Apologies if this isn't the place to ask questions - I couldn't find a mailing list.

My understanding is that a TaskScheduler is quite stateful and should be a singleton within a cluster of JVM instances. Is this view incorrect? If so then what are the considerations around scaling?

Thanks!

spodila commented 9 years ago

You are right in that a TaskScheduler is stateful and that there would be a single instance in a cluster. The size of state information in Fenzo would be proportional to the number of VMs (aka agent/host) and the number of tasks assigned. Other state information, such as related to autoscaling and groups, are too small to be concerned about. Although I have no specific data to quote, I used this test program to create 10,000 VMs (agents/hosts) each with 16 cores, filling the 160K cores with 45K tasks (1-, 8, and 12-copu tasks). I noticed the resident set size to be about 750MB. While this is not meant to be a reference to figuring out memory for a given scale, the quick hack shows you a way to test your possible scale and measure the anticipated memory size as well as the performance to expect. Fenzo makes it easy to also test new plugins for constraints and fitness calculators. LeaseProvider and TaskRequestProvider classes in the test package are useful for this, instead of requiring actual agent hosts.

huntc commented 9 years ago

Thanks for the reply @spodila.

What are your thoughts toward resiliency? For example, if your process containing the TaskScheduler dies then what action do you take?

Thanks again for the dialogue.

(closing as you've addressed my primary question)

spodila commented 9 years ago

Upon start of the process containing the TaskScheduler, we initialize Fenzo with the entire state by calling TaskScheduler.getTaskAssigner().call(...) method for each task that is already known to be running. Specifically, since we run multiple instances of our framework with ZooKeeper based leader election, we perform this initialization upon being elected as the leader.

This does bring up a concern on latency at startup with large number of running tasks. However, we haven't come to the point where that is the next big problem to solve. If that does concern you, I'd love to hear your thoughts on it and/or exchange ideas on solving it.

huntc commented 9 years ago

Perhaps a plugable mechanism that a) requests state from other scheduler instances, returning a CompletionStage aka Scala's Future, and b) a callback so that the scheduler can be notified of new state asynchronously.

This then permits multiple schedulers to work together. We could then back the request and the callback with CRDTs (for example).

I think you should also consider reentrancy in your API to support such a mechanism and concurrency in general.