Open inviscid opened 2 years ago
Thank you for your advice. I'm taking a look at your proposal
This is a very interesting idea. If we have this implemented, then pod with local storage can migrate freely and the cluster(for example pg or greenplum) can get back to normal with data reconstruction.
However, there are some obstructions need to be addressed first.
Carina need to support ephemeral storage provisioning. So that the bcache device shares same lifecycle with pod. When pod get deleted, bcache is deleted too, thus flushing cache data to persistent storage.
We need to find a way to tell carina which persistent storage to use to build bcache as cold layer.
The pod need to specify two storage class at least, one is carina and the other is persistent storage provisioner. When kubelet build container, it might run into dead lock when preparing those two volumes.
As the persistent storage is also mounted into Pod, the application inside can never write to this device. And actually, I am not sure if the pure mount operation will cause data corruption or not.
@inviscid any thoughts?
And we need to take care of power failure, we may not get a chance to flush hot data.
Is your feature request related to a problem?/Why is this needed The ability for Carina to provide tiered storage using bcache is very powerful, especially in context of database operations. However, it currently requires data to reside at the node level rather than leveraging a combination of persistent storage at the pod level and ephemeral NVMe/SSD storage at the node level. This makes it very difficult to move pods to new nodes easily.
Describe the solution you'd like in detail Would it be possible to construct the bcache volume within a pod so that it would utilize local node ephemeral NVMe/SSD disks but utilize a PV exposed at the pod level? This way, the persistent part of the bcache can move easily with the pod and the cache portion would be discarded and rebuilt once the pod has been rescheduled to a new node.
For example, in a GCP environment we can create a node with a local 375GB NVMe drive. As pods are scheduled to the node, a portion of the 375GB drive is allocated to the pod as a cache device (raw block device) as well as using a PV (raw block device) attached from the GCP persistent volume service. When the pod is initialized, the bcache device is created pod-local using the two attached block devices.
The benefit of this is the data is no longer node bound and the pods can be rescheduled easily to new nodes with their persistent data following. It would also enable resizing of individual PVs without worrying about how much disk space is attached at the node level.
Describe alternatives you've considered
Just sticking with standard network attached PVs. This is not optimal for database operations since having local disk can significantly boost read/write performance.
Try a homegrown version of this local bcache concept using TopoLVM (https://github.com/topolvm/topolvm) and network attached storage PVs.
Also looked at using ZFS ARC but that also requires setting up our own storage layer rather than leveraging GCP, AWS, or Azure managed storage.
Additional context This would have immediate use for Postgres and Greenplum running in kubernetes. The churn of rebuilding large data drives can be significant for clusters with frequent node terminations (spot instances).