appscode / k8s-addons

Kubernetes Addons by AppsCode
Apache License 2.0
5 stars 1 forks source link

Proposal: Implement stateful TPR Flock #14

Open tamalsaha opened 7 years ago

tamalsaha commented 7 years ago

Birds of a feather flock together

When we run stateful apps (apps that store data in disk) like GlusterFS or various databases, we face a choice which Kubernetes object to use for provisioning such objects. Here are the requirements:

This can't be achieved in cloud providers that do not have native support for persistent storage or Kubernetes does not have volume controller (eg, DigitialOcean, Linode, etc).

Here is my proposal on how to meet the above requirements in a cloud provider agnostic way.

StatefulSet: If the underlying cloud provider have native support cloud disk and has built-in support in Kubernetes (aws/gce/azure), then we can use StatefulSet. We can prevision disks manually and bind them with claims. We might be able to also provision them using dynamic provisioning. Moreover, StatefulSets will allow using pod name as a stable network ID. Users can also use pod placement options to ensure that pods are distributed across nodes. This allows for HA.

DaemonSet: Cloud providers that does not support built-in storage and/or has no native support in Kubernetes (eg, DigitalOcean, Linode) can't use StatefulSets to run stateful apps. Stateful apps running in these clusters must use hostpath to store data or risk losing it when pods restart. StatefulSet can't dynamically provision host path bound PVCs. In these cases, we could use DaemonSet. We have to use hostpath or emptyDir` type PV with the DaemonSet. If DaemonSets are run with pod network, no stable ID is possible. If DaemonSets run with host network, then they might use node IP. Node names are generally not routable. But Node IPs are not stable either, since most times these are allocated via DHCP. Also, for cloud providers like DigitalOcean, host network are also shared and not safe to run with out authentication.

Luckily, we can achieve something similar to StatefulSet in such providers. The underlying process is based on how named headless services work as described here: https://kubernetes.io/docs/admin/dns/ .

In these types of providers, we have to run N ReplicaSet with replica=1. We can use a fixed hostpath. We can chose N nodes and index then from 0..n-1. We apply a nodeSelector with these RCs to ensure rc with index i always runs on node with index i. Since they are on separate node, they can safely use same host path. For network ID, we set both hostname and sub-domain in the PodTemplate for these RCs. This will give the pods a dns name the same was StatefulSets pods get. Since these pods are using pod network, it should be safe to run applications without authentication. Now, we have N pods with stable name and running on different nodes using hostpath. Voila!

To simplify the full process, we can create a new TPR called Flock. We implement GlusterFS or KubeDBs using this TPR. The Flock controller will be in charge of translating this into the appropriate Kubernetes object based on flags set on the controller.

sadlil commented 7 years ago

we have to run N ReplicaSet with replica=1. We can use a fixed hostpath. We can chose N nodes and index then from 0..n-1. We apply a nodeSelector with these RCs to ensure rc with index i always runs on node with index i.

What if Floc.Spec.Replica > Node Count? we can't use fixed hostpath if multiple pod running on same node.

mirshahriar commented 7 years ago

@sadlil, we need to make sure that Flock.Spec.Replica will not greater than total node.

tamalsaha commented 7 years ago

Filed this proposal in Kube. https://github.com/kubernetes/community/issues/424 . At worst, they think I am crazy.

tamalsaha commented 7 years ago

Based on my conversation on Slack, it might be possible to implement a Dynamic Hostpath PV provisioner. And the StatefulSet can use that.

https://github.com/kubernetes-incubator/external-storage/tree/master/docs/demo/hostpath-provisioner

7:09]  
@deads2k  on a separate note, we are exploring the idea of using creating a new TPR to address the limitation of StatefulSet that it can't use hostpath. Here is the proposal: https://github.com/kubernetes/community/issues/424  .  I would be glad if you can read it and give some feedback. (edited)

deads2k [7:10 AM] 
Why can't a statefulset use a hostpath in combination with an SA and a PSP?

tamal [7:11 AM] 
What is PSP?

deads2k [7:11 AM] 
@tamal podsecuritypolicy which is designed to control access to things like hostpath

tamal [7:13 AM] 
Ok. I will read that. if # of pods > 1, can stateful set guarantee that hostpth will not overlap between pods?

liggitt [7:13 AM] 
scheduling can be configured to spread across nodes on statefulset selection, right? (edited)

tamal [7:14 AM] 
But how do I guarantee that the same pod goes to the same node?

[7:14]  
Say, pod-0 always goes to node-X

tamal [7:19 AM] 
@deads2k  , I don't think PSP can do what I need.

claytonc [8:13 AM] 
@tamal that use case sounds a bit like a hostpath dynamic provisioner

tamal [8:14 AM] 
Yes. But I could not thin of a way to do that as Statefulsets work today

[8:15]  
We need a way to store the pod index -> node mapping. Passing the node selector in StatefulSet. (edited)

tamal [8:34 AM] 
@claytonc  , how do you think we can write a hostpath dynamic provisioner ?

claytonc [8:39 AM] 
@tamal set the volume class on your stateful set to have a specific name

[8:40]  
then write a loop that creates a hostpath PV and binds it to PVCs asking for that storage class when created

[8:40]  
and ensure that loop creates unique values for hostpath PV

tamal [8:41 AM] 
But how do I guarantee that the same pod goes to the same node?

claytonc [8:52 AM] 
set a unique volume label on the PV

[8:52]  
that corresponds to the hostname of the node the pod goes to

tamal [9:01 AM] 
@claytonc, i see how PV - PVC - Pod can be connected using the unique label that corresponds to hostname

[9:02]  
But I am still missing how do I make sure scheduler always picks the same nodes when pod restarts

new messages
claytonc [9:36 AM] 
scheduler picks nodes for recreated pods that match the volume’s label

tamal [10:05 AM] 
Thanks @claytonc  . I need to read these details.  Dynamic PVC provisioners are compiled in Kube?

[10:05]  
or are they separate banaries?

claytonc [10:05 AM] 
they can be run anywhere - the simplest pattern might just be a bash scirpt for loop

[10:05]  
there’s work to make provisioners easier

[10:06]  
to script

[10:06]  
not sure how far that has gotten

tamal [10:07 AM] 
Do you mind pointing me to the current ones? I can pattern it around the existing ones.

mrick [10:18 AM] 
@tamal check out the out of tree dynamic provisioners https://github.com/kubernetes-incubator/external-storage/tree/master/docs/demo/hostpath-provisioner
github.com
external-storage/docs/demo/hostpath-provisioner at master · kubernetes-incubator/external-storage · GitHub
external-storage - External Storage Plugins, Provisoners, and Helper Libraries
tamalsaha commented 7 years ago

The example just run the provisioner in one node. We have to run that with DaemonSet on all nodes. Also, we need to know in which directory to store data. since we have to mount it inside the provisioner docker image.

Also, we need to enable PSP: https://kubernetes.io/docs/user-guide/pod-security-policy/#controlling-volumes . So that Pods can use HostPath as volumes via dynamic provisioner.