What problem does the paper solve? Is it important?
Problem:
Huge scale with 100k machines and averagely 40k (the peak is 61k) tasks/sec
Other constrains
User transparency
Stability and robustness
Backward compatible with existing cluster components and scheduling policies
Importance:
Cluster scheduling plays a significant role in a cluster.
Although the whole work is designed for the Fuxi cluster, Alibaba corporation, its primary target is handling the huge size of the cluster, which will be met at other companies soon or later.
How does it solve the problem?
Solution: adapting and improving Omega architecture
In Omega architecture, there is a master that manages the requests from all the schedulers, where each scheduler has its own scheduling policy generated during the long-term practice.
And in every request, the scheduler will fetch the latest state from the master, and the state means the availability of each resource.
How does this work relate to other research?
Mesos: limited visibility due to two-level architecture
Apollo, Mercury, Yaq-d: not as general as Omega and thus cannot be easily adopted in cluster
Sparrow, Tracil, Hawk: not general and rely on job duration estimation (it's far away from accurate by now)
YARN, Mesos, Borg, Kubernetes: do not focus on low latency and scalability
What could be improved?
How to be more scalable since we may have a larger cluster in the future
Others
The scheduler has been deployed in the real cluster and there are some math inferences, making the work look more solid.
Anything interesting?
The note has been verified by one of the paper authors, Zhi Liu. Thanks! You deserve the best paper award!
Ok, well, he is actually one of my colleagues at Singularity Data. So proud of you!
https://www.usenix.org/system/files/atc21-feng-yihui.pdf