A repo-tutorial to learn how to install and use Rook Ceph on a Kubernetes cluster.
Read the associated blog post on Medium !
Ceph is an open-source software-defined storage solution which allows you to store data as object (through the Ceph Object Gateway - S3-compatible), block (Ceph RBD) or file (CephFS).
Rook is a cloud-native storage orchestrator for Kubernetes while Ceph is a distributed storage system. Rook Ceph is the integration of Rook and Ceph, providing an easy-to-manage storage solution for Kubernetes.
What we will be doing :
Ceph allows you to get a fault-tolerant and replicated storage for your Kubernetes pods (replicated volumes).
Additionnaly, it avoids putting too much disk pressure on one single storage source. It makes you save money by lowering the cost of maintenance of your storage servers, as it is managed by Kubernetes (allows updates across the cluster, horizontal scaling on several smaller servers).
With Ceph, you have a continuous scaling path, forward, forever.
Remind that most of the time, a natively distributed system such as an Elasticsearch cluster has its nodes each using an individual PVC. Elasticsearch then natively manages data replication across its nodes. Ceph can still be useful to avoid the loss of the underlying data used by nodes. This is a kind of double security for your data. Ceph is the most useful when it comes to non-native distributed systems such as a simple PostgreSQL database or backups data.
Create a 3-nodes Kubernetes cluster
We recommend Scaleway Kapsule to easily instantiate a Kubernetes cluster with 3 nodes and attribute unformatted volumes.
Once the Kubernetes cluster has started, we will create an attached volume (disk) for each node :
The new disk should appear under /dev/sdb
on each node.
All raw disks will be used for our Ceph cluster.
Clone the Rook repo
git clone --single-branch --branch release-1.11 https://github.com/rook/rook.git
Deploy Rook resources
kubectl create -f ./rook/deploy/examples/crds.yaml
kubectl create -f ./rook/deploy/examples/common.yaml
kubectl create -f ./rook/deploy/examples/operator.yaml
All components will be instanciated in the rook-ceph
namespace.
Check the status of the deployed operator and wait :
kubectl get po -n rook-ceph
Create the Ceph cluster
kubectl create -f ./rook/deploy/examples/cluster.yaml -n rook-ceph
:information_source: You can configure the exact device and nodes used to create the Ceph cluster from line 235 of the cluster.yaml file.
Wait several minutes until the health is HEALTH_OK
:
kubectl get cephcluster -n rook-ceph
Deploy the toolbox and check the cluster status
The Ceph Toolbox can be used to perform actions on the Ceph cluster through its CLI.
kubectl create -f ./rook/deploy/examples/toolbox.yaml -n rook-ceph
Enter the toolbox pod :
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
Then check Ceph's status. Each host's state should be exists,up
:
ceph osd status
Create the RBD storage class
RBD stands for RADOS Block Device and allows you to have a storage class to provision volumes in your Kubernetes cluster. This only supports ReadWriteOnce
volumes (RWO). See step 7 for ReadWriteMany
capabilities.
:information_source: RBD's storage class name is rook-ceph-block
kubectl create -f ./rook/deploy/examples/csi/rbd/storageclass.yaml -n rook-ceph
To check that a volume correctly bind to the rook-ceph-block
storage class :
kubectl create -f ./rook/deploy/examples/csi/rbd/pvc.yaml -n rook-ceph
kubectl get pvc rbd-pvc -n rook-ceph # status should be "BOUND"
Create the CephFS storage class
CephFS acts like a replicated NFS server. This is what will allow us to create volumes in ReadWriteMany
mode (RWX).
:information_source: CephFS' storage class name is rook-cephfs
kubectl create -f ./rook/deploy/examples/filesystem.yaml -n rook-ceph
kubectl create -f ./rook/deploy/examples/csi/cephfs/storageclass.yaml -n rook-ceph
To check that a volume correctly bind to the rook-cephfs
storage class :
kubectl create -f ./rook/deploy/examples/csi/cephfs/pvc.yaml -n rook-ceph
kubectl get pvc cephfs-pvc -n rook-ceph # status should be "BOUND"
Deploy Ceph's dashboard
kubectl create -f ./rook/deploy/examples/dashboard-external-https.yaml -n rook-ceph
Forward dashboard's access :
kubectl port-forward service/rook-ceph-mgr-dashboard -n rook-ceph 8443:8443
Connect with the username admin
and following password :
kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode
Let's now deploy psitransfer !
Deploy the file sharing app
kubectl create -f ./psitransfer-deployment-rwx.yaml
See on which node it is deployed :
kubectl get pods -o wide -l app=psitransfer
Retrieve the IP of this node (through the Scaleway interface) and check the app is running at http://nodeip:30080
Let's upload some files
Download the 5MB, 10MB and 20MB files from xcal1.vodafone.co.uk website.
Upload them to our file transfer app. Click the link that appears on screen.
You should now see the tree files imported. Click on it and keep the link in your browser tab, we'll use it later.
After uploading around 400MB of files, we can prove the replication of data is coherent across disks. We see that the 3 disks are written simultaneously while we upload files. In the following screenshot, usage is 1% for each disk : although I uploaded on the same host, it seems the replication is working as expected with data equally persisted across the 3 disks (OSDs). Disk 2 has a lot of "read" activity as the 2 other disks synchronize data from it.
This is how Ceph's dashboard should look like :
We're going to stop the node hosting the web app to make sure data was replicated on the other nodes.
See on which node the app is deployed :
kubectl get pods -o wide -l app=psitransfer
Poweroff the node from the Scaleway console
This simulates a power failure on a node. It should become NotReady
after several minutes :
$> kubectl get node
NAME STATUS ROLES AGE VERSION
scw-ceph-test-clustr-default-5f02f221c3814b47a Ready <none> 3d1h v1.26.2
scw-ceph-test-clustr-default-8929ba466e404a00a Ready <none> 3d1h v1.26.2
scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8 NotReady <none> 3d1h v1.26.2
And Node 3 is unavailable on our Ceph dashboard :
Ceph's dashboard should look like this :
Reschedule our pod
Scheduled pod node is unavailable. However, our pod still thinks it is active :
$> kubectl get pods -o wide -l app=psitransfer
NAME READY STATUS RESTARTS AGE IP NODE
psitransfer-deployment-8448887c9d-mt6wm 1/1 Running 0 19h 100.64.1.19 scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8
Delete it to reschedule it on another node :
kubectl delete pod psitransfer-deployment-8448887c9d-mt6wm
Check the status of the newly-restarted pod. Your app should be available again at the link previously kept.
:information_source: To avoid having to manually delete the pod to be rescheduled when a node gets "NotReady", scale the number of replicas of your app to at least 3 by default.
You can now restart the previously powered-off node.
If your applications need better performance and require block storage with RWO access mode, use the rook-ceph-block (RBD) storage class. On the other hand, if your applications need a shared file system with RWX (CephFS) access mode and POSIX compliance, use the rook-cephfs storage class.
If choosing RBD and trying to reschedule a pod while its original node is offline as we did with CephFS, you will get an error from the PVC stating : "Volume is already exclusively attached to one node and can't be attached to another". In that case, you just need to wait for the PVC to bind back (it took me ~6 minutes for the cluster to automatically re-attribute the PVC to my pod, allowing it to start).
Try this behavior following this procedure :
Running RBD's deployment example
kubectl create -f ./psitransfer-deployment-rwo.yaml
Shutting down the instance on which the pod is scheduled
kubectl get pods -o wide -l app=psitransfer-rwo
Cordon the node to make sure nothing will be re-scheduled on it :
kubectl cordon scw-ceph-test-clustr-default-94ef39ea5b1f4b3e8
Deleting the pod for it to be reschedule (and wait ~6 minutes for PVC to bind)
kubectl delete pod --grace-period=0 --force psitransfer-deployment-8448887c9d-mt6wm
You should be able to still access data uploaded previously on the app.