loss of data in k8s persistent volumes when used in k8sm setup

@jdef k8s kubelet sets up the volume mounts in a directory configured by --root-dir flag and bind mounts them on to docker containers. kubelet also runs Kubelet.cleanupOrphanedVolumes() in sync loop to cleanup/unmount any volume mounts that are left out on the kubelet host from any killed/finished pods.

In case of k8sm, kubelet's RootDirectory is set to mesos executor's sandbox dir. this is overridable using --kubelet-root-dir flag in scheduler, but this doesn't work due to executor using the same dir for setting up static-pod-config dir and the executor fails to come up with error if it finds an already existing static-pods dir. the issue we are seeing with kubelet executor using the sandbox dir itself as kubelet root-dir is, whenever the executor id (or slave id) changes due to executor restart/slave restart/framework upgrade, the kubelet doesn't get a chance to properly cleanup orphaned volumes mounted on the host. in case of persistent volumes, the kubelet pods volume dir would still be pointing to mounted filesystem. Then, the mesos slave's gc_delay setting kicks in and tries to cleanup the old executors sandbox dirs, which leads to rm'ing of persistent volume dirs. the end result is : all data backed by persistent volume are gone.

i think static-pods dir should be using the mesos sandbox dir instead of using kubelet.RootDirectory. then one could set --kubelet-root-dir to a static path on the slave host. But, still there is no guarantee that a slave gets assigned a kubelet executor task again, which means the kubelet volume dirs might be left mounted forever. But atleast, they won't be deleted inadvertently by the mesos-slave gc.

we are experiencing this in our k8sm cluster, using nfs backed persistent volumes.

First thoughts about reasonable defaults (because executors really shouldn't write anything outside of their container):

rootDir={sandbox}/root staticPods={sandbox}/static

And then if you want to override rootDir to point to some location on the host, outside of the sandbox, you could do that. Although I'm not convinced that's a great idea, I can certainly sympathize with data loss! Running an executor this way (with rootDir outside the sandbox) is prone to mount resource leaks, as you've pointed out, among others: there may be old pod directories that are never cleaned up (and those may also contain mounts). How would we ever, responsibly, GC these? On executor startup (that might not happen for a while, depending on offers)?

We're actually thinking through a related problem right now with respect to https://issues.apache.org/jira/browse/MESOS-5013: what's the best way to GC external volume mount points in a way that's compatible with Mesos slave recovery?

A better solution might come in the form of a custom k8sm runtime implementation for kubelet that allows kubelet to properly contain the pod containers that it launches: kubelet could run in its own mountns and pods would be realized in containers that inherit the requisite root-Dir volume mounts from the kubelet's mountns. This is non-trivial.

Another solution might be to write a custom mesos isolator module that adds GC for volume mounts created within a kubelet-executor container. This is also non-trivial.

On Tue, Mar 29, 2016 at 11:02 PM, ravilr notifications@github.com wrote:

@jdef https://github.com/jdef k8s kubelet sets up the volume mounts in a directory configured by --root-dir flag and bind mounts them on to docker containers. kubelet also runs Kubelet.cleanupOrphanedVolumes() in sync loop to cleanup/unmount any volume mounts that are left out on the kubelet host from any killed/finished pods.

In case of k8sm, kubelet's RootDirectory is set to mesos executor's sandbox dir. this is overridable using --kubelet-root-dir flag in scheduler, but this doesn't work due to executor using the same dir for setting up static-pod-config dir and the executor fails to come up with error if it finds an already existing static-pods dir. the issue we are seeing with kubelet executor using the sandbox dir itself as kubelet root-dir is, whenever the executor id (or slave id) changes due to executor restart/slave restart/framework upgrade, the kubelet doesn't get a chance to properly cleanup orphaned volumes mounted on the host. in case of persistent volumes, the kubelet pods volume dir would still be pointing to mounted filesystem. Then, the mesos slave's gc_delay setting kicks in and tries to cleanup the old executors sandbox dirs, which leads to rm'ing of persistent volume dirs. the end result is : all data backed by persistent volume are gone.

i think static-pods dir should be using the mesos sandbox dir instead of using kubelet.RootDirectory. then one could set --kubelet-root-dir to a static path on the slave host. But, still there is no guarantee that a slave gets assigned a kubelet executor task again, which means the kubelet volume dirs might be left mounted forever. But atleast, they won't be deleted inadvertently by the mesos-slave gc.

we are experiencing this in our k8sm cluster, using nfs backed persistent volumes.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/mesosphere/kubernetes-mesos/issues/798

d2iq-archive / kubernetes-mesos

loss of data in k8s persistent volumes when used in k8sm setup #798