sebgl commented 5 years ago

Local persistent volume?

Concepts

Local persistent volume

See the official doc: https://kubernetes.io/docs/concepts/storage/volumes/#local The idea behind persistent volumes is to create:

a persistent volume resource, bound to a specific node with a selector, that has a particular capacity
a persistent volume claim to use that resource, that a pod spec will reference as volume

The problem here is we need to create these "PersistentVolume", but we don't know in advance what are going to be the pod's storage requirements. That's why some persistent volumes have "dynamic" provisioner, in the sense that the PV will be created automatically by a controller to match a given PVC. There is no dynamic provisioner yet for local persistent volumes.

It is marked as beta in Kubernetes v1.10.

CSI

Stands for Container Storage Interface (similar to CNI: Container Networking Interface). A project to standardize the way vendors implement their k8s storage API.

Spec: https://github.com/container-storage-interface/spec K8S doc: https://kubernetes-csi.github.io/docs/ A short read on how it works: https://medium.com/google-cloud/understanding-the-container-storage-interface-csi-ddbeb966a3b A more complete read on how to implement a CSI: https://arslan.io/2018/06/21/how-to-write-a-container-storage-interface-csi-plugin/ CSI community sync agenda: https://docs.google.com/document/d/1-oiNg5V_GtS_JBAEViVBhZ3BYVFlbSz70hreyaD7c5Y/edit#heading=h.h3flg2md1zg

Components:

K8S-internal (not vendor specific, maintained by K8S team):

driver-registrar: Sidecar container that registers the CSI driver with kubelet, and 2) adds the drivers custom NodeId to a label on the Kubernetes Node API Object.
external-provisioner: Sidecar container that watches Kubernetes PersistentVolumeClaim objects and triggers CreateVolume/DeleteVolume against a driver endpoint.
external-attacher: Sidecar container that watches Kubernetes VolumeAttachment objects and triggers ControllerPublish/Unpublish against a driver endpoint

CSI driver - vendor specific (all these should implement the gRPC CSI standard interface), composed of 3 components:

identity: gRPC server identifying the plugin service, making sure it’s healthy, and returning basic information about the plugin itself. Can be implemented as part of the Node or Controller plugins.
Node plugin: deployed on each node, gRPC server responsible for mounting volumes. Usually a daemonset.
controller: usually deployed only once. gRPC server responsible for controlling and managing all volumes in the cluster.

FlexVolumes

FlexVolumes can be seen as the old, unclean version of CSI. It also allows vendors to write their own storage plugins. The plugin driver needs to be installed to a specific path on each node (/usr/libexec/kubernetes/kubelet-plugins/volume/exec/). It's basically a binary exec file that needs to support a few subcommands (init, attach, detach, mount, unmount etc.).

Our need

We need dynamically provisioned local persistent volumes. Which means a controller should take care of mapping existing PVC to a new PV of the expected size. Also, we expect the size to act as a quota: when reached, the user should not be able to write on disk anymore. This is a strong requirement: for instance using ext4 behind our persistent volumes would probably not guarantee this.

Interesting resources

1. kubernetes-incubator local-volume static provisioner

Links: https://github.com/kubernetes-incubator/external-storage/tree/master/local-volume, https://github.com/kubernetes-incubator/external-storage/tree/master/local-volume/provisioner

A static local volume provisioner, running as a daemon set on all nodes of the cluster. It monitors mount points on the system, and maps it to the creation of a PV of the corresponding size. Mount points are discovered in the configured discovery dir (eg. /mnt/disks). To work with directory-based volumes instead of device-based volumes, we can simply symlink dirs to the directory we want into the discovery dir. It does not handle any quota; but the backing FS could (eg. XFS or LVM). Code is open source, quite small and simple to understand.

Dynamic provisioner WIP

Dynamic provisioning seems to be WIP according to this issue. There is a PR open for design proposal and a PR open for implementation. Based on the design doc:

LVM seems to be the target filesystem to support first
The driver running on all nodes should be a CSI provisioner

Based on this comment, the dynamic CSI provisioner is still at the level of "internal discussions".

lichuqiang seems to be pretty involved in that. Interestingly, he created a Github repo for a CSI driver which is mostly based on mesosphere's csilvm.

Overall, this looks very promising and close to what we need. Patience is required :)

2. Mesosphere csilvm

Link: https://github.com/mesosphere/csilvm

A CSI for LVM2. It lives as a single csilvm binary that implements both Node and Controller plugins, to be installed on every node. The names of the volume group (VG) and the physical volumes (PVs) it consists of are passed to the plugin at launch time as command-line parameters.

It is originally intended to work on Mesos, not on Kubernetes. But the CSI standard is supposed to work for both.

The code is quite clean and easy to understand :)

This issue contains some interesting comments (from july) on how the project does not exactly comply with the expected k8s interface.

I could not find any reference of someone using it as a daemonset on a k8s cluster.

3. wavezhang/k8s-csi-lvm

Link: https://github.com/wavezhang/k8s-csi-lvm

Seems a bit less clean than csilvm, but explicitely targets Kubernetes. Not much doc, and only a few commits, but the code looks quite good. Based on the code and examples, can be deployed as a DaemonSet (along with the required kubernetes CSI stuff and apiserver configuration). It relies on having lvmd installed on he host, with a LVM volume group pre-created. See this bash script, supposed to be run on each node.

Potential solutions for dynamic provisioning

Best solution seems to be:

A CSI implementation for dynamic local storage
- Runs as a daemon set on all nodes (or a subset of labeled nodes)
- With access to the host disks (devices or partitions)
- Probably needs to be privileged (depending on the FS choice)
- Dynamically provision a PV based on pods PVC
It needs to "discover" local disks
- Can we assume it can use whatever local disk is attached to the node? I think not, since they might be used already for other pods
- Most solutions (including the official static local volume provisioner) seem to rely on a "discovery directory", where all disks or partitions should be mounted (or symlinked) to be discovered. Eg. /mnt/disks
- Another option could be to label the node resource with paths to the disks that can be used as local volumes. This does not seem to be very standard though.

Potential solutions for quota-aware filesystems

Best solution seems to rely on LVM + XFS:

Create one volume group from the available physical disks or partitions
Create one logical volume per pod, with the requested size: the volume size acts as the quota
Use thin volume if we need to overcommit on disk space
Then we need to choose over XFS or ext4 filesystems for the logical volumes:
- XFS is supposed to have better performance, but cannot be shrinked without being unmounted first
- ext4 does support hot grow/shrink

XFS

XFS is an I/O optimized file system (compared to eg. ext4). It supports quotas per directory (xfs_quota command), allowing us to create a directory per user, and associate a quota to it. This is the solution we use on Elastic Cloud.

LVM

LVM allows to gather multiple disks or partitions to form a logical Volume Group (VG) (vgcreate command), where physical disks are abstracted. In this volume group, multiple Logical Volumes (LV) can be created with the chosen size (pvcreate command), and formatted with any FS we want (mkfs command). A logical volume may span over multiple physical disks.

LVM thin provisioning allows to create a thin pool on which we can allocate multiple thin volumes with a given size. That size will appear as the volume size, but the occupied underlying disk space will not be reserved for the volume. For instance, we could have a 10GB thin pool with 3x5GB thin volumes. Each volume would see 5GB, and everything is fine as long as the entire underlying 10GB is not fully occupied. It allows us to overcommit on disk space.

The "quota" in LVM would simply be the size of the created logical volume.

Volume group and logical volumes can be resized without unmounting them. However the logical volumes FS also needs to support that. Ext4 does support hot resize (grow and shrink), but XFS does only support hot grow (not shrink).

I/Os can be limited through Linux cgroups per logical volume (see https://serverfault.com/questions/563129/i-o-priority-per-lvm-volume-cgroups).

sebgl commented 5 years ago

Discussing it on zoom with @nkvoll and @pebrc:

CSI does not work that well for us since it does not really map a persistent volume to a host. It works well for the case of "volume can actually be reached from any node". Otherwise, the controller plugin needs to take each individual node capacity into consideration and make sure to adapt affinity settings.

Works seems to be ongoing (cf. the design proposal) to more closely map local volumes attached to a particular node, dynamic provisioning, and remaining capacity considerations. Probably not ready for at least another year though; but that would be the appropriate "long-term" solution.

Meanwhile, a FlexVolume implementation seems to be the easier workaround for a short-term solution. The design is not as great as CSI, but simpler to deploy and implement.

The decision here for the short-term is to implement a FlexVolume plugin that handles LVM volumes provisioning. Having it running as a daemonset can also enable garbage collection on local volumes which are not referenced anymore.

It's harder to consider storage capacity though. A workaround that was also adopted on Cloud is to only consider RAM usage for pods allocation, while considering there is a global FS multiplier that we can apply from RAM to storage space. The goal here is that the cluster operator would run out of RAM before running out of disk on a given node. This is to avoid hacky workarounds around persistent volume definitions vs. actual volume size that could still be implemented :-)

sebgl commented 5 years ago

I'm closing this since we implemented our own dynamic local-volume provisioner based on a flex driver.

elastic / cloud-on-k8s

Explore local persistent volumes feasibility #108

Local persistent volume?

Concepts