Support for NVMe volumes

artem-zinnatullin commented 3 years ago

Hi!

We're looking for an automated way to provision PersistentVolumeClaims against locally mounted NVMe drives on DigitalOcean https://www.digitalocean.com/blog/introducing-storage-optimized-droplets-with-nvme-ssds/

We've tried local StorageClass https://kubernetes.io/docs/concepts/storage/storage-classes/#local, it does work however it is not automated at all, unlike DO Block Storage in k8s:

We have to manually create PerstistentVolumes
Each PersistentVolume has to be constrained to a particular node with nodeAffinity
Each PersistentVolume has to have capacity manually defined, however it does not act as a limit since NVMe storage is mounted as root / filesystem on Premium and Storage Optimized Droplets with NVMe
Each PersistentVolume must have only one assosiated PersistentVolumeClaim otherwise Pods using it will not be scheduled
Each new Node added to cluster will have to have PVs and PVCs configured, which defeats the benefit of k8s autoscaling.

We're looking into CSI implementations like https://github.com/minio/direct-csi, however major blocker there is that it only works with additional (non-root /) disks, but DigitalOcean Premium droplets use NVMe drive as root /.

The question is: can you consider adding support for DigitalOcean NVMe drives to csi-digitalocean please? :)

Thanks!

adamwg commented 3 years ago

Hello,

We are considering adding support for dynamic provisioning of local storage volumes in DOKS, however it likely will not be implemented in this CSI driver.

The significant caveat to using node-local NVMe/SSD storage is that it is indeed node-local - we can't detach it from one node and attach it to another. This means it's really only useful for ephemeral purposes, since we expect nodes to be replaced in the course of normal cluster operations (e.g., due to health or for upgrade).

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Thanks!

cc @bikram20

artem-zinnatullin commented 3 years ago

We are considering adding support for dynamic provisioning of local storage volumes in DOKS

That's great news!

however it likely will not be implemented in this CSI driver.

Interesting, how it'd be exposed and mounted then?

The significant caveat to using node-local NVMe/SSD storage is that it is indeed node-local - we can't detach it from one node and attach it to another. This means it's really only useful for ephemeral purposes, since we expect nodes to be replaced in the course of normal cluster operations (e.g., due to health or for upgrade).

We do understand this caveat. There are cases when it's fine, we want to run distributed Database on NVMe storage and distributed object store. Due to performance requirements we do want to use NVMes that DigitalOcean offers. In our case the applications are distributed meaning that a Node shutdown for say upgrades and is fine since other nodes will act as replicas, this is achieved via nodeAffinity rules in the app deployment so that pods of these apps are not running on same nodes that already have them running.

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Let's continue publicly in this issue, there are very little public discussions on this topic so I'd like to use this thread as an opportunity to add more information on using local NVMe drives with Kubernetes to internet :)

adamwg commented 3 years ago

We are considering adding support for dynamic provisioning of local storage volumes in DOKS

That's great news!

however it likely will not be implemented in this CSI driver.

Interesting, how it'd be exposed and mounted then?

We would add an additional StorageClass with a separate provisioner, potentially leveraging an existing project like the direct-csi driver you linked. There's nothing DO-specific about node-local storage, so no need to add it to the DO CSI driver.

artem-zinnatullin commented 3 years ago

Sounds good!

artem-zinnatullin commented 3 years ago

Submitted related issue on partitioning NVMe drives for DOKS nodes https://github.com/digitalocean/DOKS/issues/27, basically we can't repartition NVMe drive right now..

kainz commented 3 years ago

This sort of provisioning is also useful for running your own database workloads on nodes if you need something with the local nVME performance. Yes, the storage is 'ephemeral', but that is something database management tools like zalando or stolon can take into account, especially when combined with things like pod disruption budgets.

You can implement solutions for that need today by running self-managed k8s clusters alongside a managed one, but the administration workload also multiplies accordingly in that case. Managed DOKS as of 1.20 at least is almost there with the ability to run your so1.5* plan node pools. If you offered a way to allow a node pool to upgrade in-place, an operator needing to run a local datastore could run it entirely in managed DOKS.

In my particular usecase, I have clients who need to run PostgreSQL services with custom extensions and replication patterns, so that disqualifies most managed SQL offerings as well, thus my interest in closing the feature gaps in managing ephemeral storage on cloud instances/droplets.

kallisti5 commented 2 years ago

Hm. Vultr has been doing NVMe for a while as default for their Managed Kubernetes solution. This is a big difference with no additional cost.

bikram20 commented 2 years ago

@kallisti5 What kind of workloads are you looking to run on NVMe local storage? Would you be okay with ephemeral nodes? Nodes are recycled during release upgrade.

kallisti5 commented 2 years ago

@bikram20 Overall I'm trying to find a cost-effective way to leverage the standard DO instance sizes.

Running a reliable ReadWriteMany storage model is pretty difficult at Digital Ocean. My solution was longhorn storage (https://longhorn.io) since it maintains and grooms RWX replicas between all of the kubernetes nodes directly (using the massive amount of wasted space on each k8s node pool droplet saving costs (the 4vcpu / 8GiB nodes have over 100GiB which will go unused for most people using do's csi)). it also automatically backs up data to s3.

NVMe though would probably be the minimum requirement to maintain replicas within a reasonable timeframe.

DO really needs a managed storage solution that can do RWX like Gluster or NFS.

The workload itself is 300 GiB+ of software packages for Haiku (https://haiku-os.org) plus some other infrastructure.

AlbinoDrought commented 3 months ago

For others that are interested, a potential workaround is to mount file containers. Here's an example (original source):

File Container YAML

```yaml --- apiVersion: v1 kind: Namespace metadata: name: xfs-disk-setup --- apiVersion: apps/v1 kind: DaemonSet metadata: name: xfs-disk-setup namespace: xfs-disk-setup labels: app: xfs-disk-setup spec: selector: matchLabels: app: xfs-disk-setup template: metadata: labels: app: xfs-disk-setup spec: tolerations: - operator: Exists containers: - name: xfs-disk-setup image: docker.io/scylladb/local-csi-driver:latest imagePullPolicy: IfNotPresent command: - "/bin/bash" - "-euExo" - "pipefail" - "-O" - "inherit_errexit" - "-c" - | img_path="/host/var/persistent-volumes/persistent-volume.img" img_dir=$( dirname "${img_path}" ) mount_path="/host/mnt/persistent-volumes" mkdir -p "${img_dir}" if [[ ! -f "${img_path}" ]]; then dd if=/dev/zero of="${img_path}" bs=1024 count=0 seek=10485760 fi FS=$(blkid -o value -s TYPE "${img_path}" || true) if [[ "${FS}" != "xfs" ]]; then mkfs --type=xfs "${img_path}" fi mkdir -p "${mount_path}" remount_opt="" if mountpoint "${mount_path}"; then remount_opt="remount," fi mount -t xfs -o "${remount_opt}prjquota" "${img_path}" "${mount_path}" sleep infinity securityContext: privileged: true volumeMounts: - name: hostfs mountPath: /host mountPropagation: Bidirectional volumes: - name: hostfs hostPath: path: / ```

You can then use https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner or any other local volume "provisioner" like normal.

The above DaemonSet creates a sparse file by default. To instead reserve the amount of space specified, try a syntax like dd if=/dev/zero of="${img_path}" bs=1M count=${img_size_mb} instead of dd if=/dev/zero of="${img_path}" bs=1024 count=0 seek=10485760.

I benchmarked this on s-2vcpu-4gb-120gb-intel with a non-spare file container mounted. Here's the fio config:

[read]
    direct=1
    bs=8k
    size=1G
    time_based=1
    runtime=240
    ioengine=libaio
    iodepth=32
    end_fsync=1
    log_avg_msec=1000
    directory=/data
    rw=read
    write_bw_log=read
    write_lat_log=read
    write_iops_log=read

and here's the results:

Storage Class	IOPS	BW
Local File Container
Digital Ocean Block Storage

The block storage benchmarks match what is currently listed on the Limits page (7500 IOPS * 8k blocksize = 60MB/s).

If you're able to share, I'd be interested to hear more about your use-case for local storage. We can connect over email if you'd rather discuss privately.

Not OP, but I'm interested in this for use with CloudNative-PG as an alternative to Managed Databases (we have different RPO requirements).

For what it's worth, here's our rudimentary pgbench results on CloudNative-PG using the above local file container vs managed database:

DB	Init	Select	RW
CloudNative-PG	``` done in 356.09 s (drop tables 0.00 s, create tables 0.02 s, client-side generate 212.51 s, vacuum 10.92 s, primary keys 132.64 s). ```	``` pgbench (16.3 (Debian 16.3-1.pgdg110+1)) starting vacuum...end. transaction type: scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 8 maximum number of tries: 1 duration: 30 s number of transactions actually processed: 152322 number of failed transactions: 0 (0.000%) latency average = 1.571 ms initial connection time = 90.086 ms tps = 5092.207290 (without initial connection time) Stream closed EOF for default/pgbench-run3ro-snc5r (pgbench) ```	``` pgbench (16.3 (Debian 16.3-1.pgdg110+1)) starting vacuum...end. transaction type: scaling factor: 1000 query mode: simple number of clients: 64 number of threads: 64 maximum number of tries: 1 duration: 30 s number of transactions actually processed: 37106 number of failed transactions: 0 (0.000%) latency average = 50.980 ms initial connection time = 602.344 ms tps = 1255.393173 (without initial connection time) Stream closed EOF for default/pgbench-run64x64-sd9hb (pgbench) ```
Managed (1x s-4gb-2vcpu)	``` done in 295.82 s (drop tables 0.00 s, create tables 0.00 s, client-side generate 187.63 s, vacuum 1.59 s, primary keys 106.59 s). ```	``` pgbench (16.3 (Debian 16.3-1.pgdg110+1)) starting vacuum...end. transaction type: scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 8 maximum number of tries: 1 duration: 30 s number of transactions actually processed: 132639 number of failed transactions: 0 (0.000%) latency average = 1.805 ms initial connection time = 85.801 ms tps = 4432.812902 (without initial connection time) Stream closed EOF for default/pgbench-run3ro-cloud-6n5ss (pgbench) ```	``` pgbench (16.3 (Debian 16.3-1.pgdg110+1)) starting vacuum...end. transaction type: scaling factor: 1000 query mode: simple number of clients: 64 number of threads: 64 maximum number of tries: 1 duration: 30 s number of transactions actually processed: 25884 number of failed transactions: 0 (0.000%) latency average = 72.894 ms initial connection time = 691.792 ms tps = 877.986444 (without initial connection time) Stream closed EOF for default/pgbench-run64x64-cloud-v679b (pgbench) ```

digitalocean / csi-digitalocean

Support for NVMe volumes #384