elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
47 stars 707 forks source link

Explore local persistent volumes feasibility #108

Closed sebgl closed 5 years ago

sebgl commented 5 years ago

Local persistent volume?

Concepts

Local persistent volume

See the official doc: https://kubernetes.io/docs/concepts/storage/volumes/#local The idea behind persistent volumes is to create:

The problem here is we need to create these "PersistentVolume", but we don't know in advance what are going to be the pod's storage requirements. That's why some persistent volumes have "dynamic" provisioner, in the sense that the PV will be created automatically by a controller to match a given PVC. There is no dynamic provisioner yet for local persistent volumes.

It is marked as beta in Kubernetes v1.10.

CSI

Stands for Container Storage Interface (similar to CNI: Container Networking Interface). A project to standardize the way vendors implement their k8s storage API.

Spec: https://github.com/container-storage-interface/spec K8S doc: https://kubernetes-csi.github.io/docs/ A short read on how it works: https://medium.com/google-cloud/understanding-the-container-storage-interface-csi-ddbeb966a3b A more complete read on how to implement a CSI: https://arslan.io/2018/06/21/how-to-write-a-container-storage-interface-csi-plugin/ CSI community sync agenda: https://docs.google.com/document/d/1-oiNg5V_GtS_JBAEViVBhZ3BYVFlbSz70hreyaD7c5Y/edit#heading=h.h3flg2md1zg

Components:

K8S-internal (not vendor specific, maintained by K8S team):

CSI driver - vendor specific (all these should implement the gRPC CSI standard interface), composed of 3 components:

FlexVolumes

FlexVolumes can be seen as the old, unclean version of CSI. It also allows vendors to write their own storage plugins. The plugin driver needs to be installed to a specific path on each node (/usr/libexec/kubernetes/kubelet-plugins/volume/exec/). It's basically a binary exec file that needs to support a few subcommands (init, attach, detach, mount, unmount etc.).

Our need

We need dynamically provisioned local persistent volumes. Which means a controller should take care of mapping existing PVC to a new PV of the expected size. Also, we expect the size to act as a quota: when reached, the user should not be able to write on disk anymore. This is a strong requirement: for instance using ext4 behind our persistent volumes would probably not guarantee this.

Interesting resources

1. kubernetes-incubator local-volume static provisioner

Links: https://github.com/kubernetes-incubator/external-storage/tree/master/local-volume, https://github.com/kubernetes-incubator/external-storage/tree/master/local-volume/provisioner

A static local volume provisioner, running as a daemon set on all nodes of the cluster. It monitors mount points on the system, and maps it to the creation of a PV of the corresponding size. Mount points are discovered in the configured discovery dir (eg. /mnt/disks). To work with directory-based volumes instead of device-based volumes, we can simply symlink dirs to the directory we want into the discovery dir. It does not handle any quota; but the backing FS could (eg. XFS or LVM). Code is open source, quite small and simple to understand.

Dynamic provisioner WIP

Dynamic provisioning seems to be WIP according to this issue. There is a PR open for design proposal and a PR open for implementation. Based on the design doc:

Based on this comment, the dynamic CSI provisioner is still at the level of "internal discussions".

lichuqiang seems to be pretty involved in that. Interestingly, he created a Github repo for a CSI driver which is mostly based on mesosphere's csilvm.

Overall, this looks very promising and close to what we need. Patience is required :)

2. Mesosphere csilvm

Link: https://github.com/mesosphere/csilvm

A CSI for LVM2. It lives as a single csilvm binary that implements both Node and Controller plugins, to be installed on every node. The names of the volume group (VG) and the physical volumes (PVs) it consists of are passed to the plugin at launch time as command-line parameters.

It is originally intended to work on Mesos, not on Kubernetes. But the CSI standard is supposed to work for both.

The code is quite clean and easy to understand :)

This issue contains some interesting comments (from july) on how the project does not exactly comply with the expected k8s interface.

I could not find any reference of someone using it as a daemonset on a k8s cluster.

3. wavezhang/k8s-csi-lvm

Link: https://github.com/wavezhang/k8s-csi-lvm

Seems a bit less clean than csilvm, but explicitely targets Kubernetes. Not much doc, and only a few commits, but the code looks quite good. Based on the code and examples, can be deployed as a DaemonSet (along with the required kubernetes CSI stuff and apiserver configuration). It relies on having lvmd installed on he host, with a LVM volume group pre-created. See this bash script, supposed to be run on each node.

4. akrog/ember-csi

Link: https://github.com/akrog/ember-csi

5. scality/metalk8s LVM usage

Link: https://github.com/scality/metalk8s

6. Dynamic provisioner FlexVolume implementation

Link: https://github.com/monostream/k8s-localflex-provisioner

Potential solutions for dynamic provisioning

Best solution seems to be:

Potential solutions for quota-aware filesystems

Best solution seems to rely on LVM + XFS:

XFS

XFS is an I/O optimized file system (compared to eg. ext4). It supports quotas per directory (xfs_quota command), allowing us to create a directory per user, and associate a quota to it. This is the solution we use on Elastic Cloud.

LVM

image

LVM allows to gather multiple disks or partitions to form a logical Volume Group (VG) (vgcreate command), where physical disks are abstracted. In this volume group, multiple Logical Volumes (LV) can be created with the chosen size (pvcreate command), and formatted with any FS we want (mkfs command). A logical volume may span over multiple physical disks.

LVM thin provisioning allows to create a thin pool on which we can allocate multiple thin volumes with a given size. That size will appear as the volume size, but the occupied underlying disk space will not be reserved for the volume. For instance, we could have a 10GB thin pool with 3x5GB thin volumes. Each volume would see 5GB, and everything is fine as long as the entire underlying 10GB is not fully occupied. It allows us to overcommit on disk space.

The "quota" in LVM would simply be the size of the created logical volume.

Volume group and logical volumes can be resized without unmounting them. However the logical volumes FS also needs to support that. Ext4 does support hot resize (grow and shrink), but XFS does only support hot grow (not shrink).

I/Os can be limited through Linux cgroups per logical volume (see https://serverfault.com/questions/563129/i-o-priority-per-lvm-volume-cgroups).

sebgl commented 5 years ago

Discussing it on zoom with @nkvoll and @pebrc:

CSI does not work that well for us since it does not really map a persistent volume to a host. It works well for the case of "volume can actually be reached from any node". Otherwise, the controller plugin needs to take each individual node capacity into consideration and make sure to adapt affinity settings.

Works seems to be ongoing (cf. the design proposal) to more closely map local volumes attached to a particular node, dynamic provisioning, and remaining capacity considerations. Probably not ready for at least another year though; but that would be the appropriate "long-term" solution.

Meanwhile, a FlexVolume implementation seems to be the easier workaround for a short-term solution. The design is not as great as CSI, but simpler to deploy and implement.

The decision here for the short-term is to implement a FlexVolume plugin that handles LVM volumes provisioning. Having it running as a daemonset can also enable garbage collection on local volumes which are not referenced anymore.

It's harder to consider storage capacity though. A workaround that was also adopted on Cloud is to only consider RAM usage for pods allocation, while considering there is a global FS multiplier that we can apply from RAM to storage space. The goal here is that the cluster operator would run out of RAM before running out of disk on a given node. This is to avoid hacky workarounds around persistent volume definitions vs. actual volume size that could still be implemented :-)

sebgl commented 5 years ago

I'm closing this since we implemented our own dynamic local-volume provisioner based on a flex driver.