Open gyliu513 opened 8 years ago
Thanks for much for sharing this! Revocable resources has definitely come up in prior conversations. A sticky point w/ respect to the current implementation in mesos is this:
NOTE: If any resource used by a task or executor is revocable, the whole container is treated as a revocable container and can therefore be killed or throttled by the QoS Controller.
For k8sm "the whole container" means the custom executor container that hosts the kubelet-executor and kube-proxy processes (as well as the k8s-instantiated Docker containers/procs if you're running w/ cgroup reparenting). This means that even if just one pod is using revocable resources and a QoS controller decides that it wants those particular resources back, the controller will end up killing all k8sm-related procs on the slave (read: all pods die because one pod's resources needed to be revoked).
A related topic that's surfaced in Mesos-land is nested containerization support, so that custom executors may spawn child containers that are independently isolated via mesos containerization. This appears, at least at the surface, to have interesting implications for revocable resources.
Did you have any work-arounds in mind for dealing with the current QoS-killed-all-my-pods scenario?
Thanks @jdef , just append some of my thoughts here: 1) The current revocable resources is kind of "scavenge resources", and the QoS Controller will kill the executors using revocable resources and when the executor is terminated, kill all of its tasks. Seems the "scavenge resources" is not good to fit into k8sm user scenario, we may need to enhance the Mesos QoS Controller only kill related tasks but not executors. 2) The Mesos community is planning to add more revocable resources such as allocation slack (MESOS-1607), quota slack (MESOS-4392) etc, and those revocable resource will trigger task/executor eviction from allocator and thus the Mesos will kill the task/executor based on some eviction policies.
For now, only "scavenge resources" revocable is supported, I will raise this issue in Mesos community to see how we can move this forward. Hope this helps ;-)
Used to do some investigation on supporting revocable resources in Kubernetes (filed an issue at https://github.com/kubernetes/kubernetes/issues/19529). Currently, only QoS support revocable; so the behaviour is simple: kill the revocable resources directly. There's several tickets in Mesos community on revocable's behaviour (MESOS-4303, MESOS-1607 and MESOS-4392).
I'd suggest to refer k8sm's case on those tickets; and hold this work (kubernetes revocable resources) until revocable resources's behaviour finalised in Mesos.
What do you mean of " kill the revocable resources directly for current revocable resources"?
I mean QoS's current behaviour; no grace period when kill a executor/container.
The Mesos is now doing many enhancement for Mesos and especially for allocator part to improve resource utilisation, the revocable resource is designed for such cases. If there are multiple frameworks running on top of Mesos including Kubernetes and other frameworks, the revocable resources from one framework can be used by another framework so as to improve resources utilisation.
I did a prototype and did some test here: https://github.com/jay-lau/jay-work/blob/master/k8s/mesos/revocable.diff
The idea is simple and straight forward: 1) Add a flag to enable revocable resources. 2) Add a new metadata in Pod YAML to enable this Pod can specify it want to use revocable resources. 3) Update procurement to add some checking for revocable resources when revocable is enabled in Pod. 4) Update Task Info to enable the task use revocable resources before it launched when revocable is enabled in Pod.