ceph / ceph-csi

CSI driver for Ceph
Apache License 2.0
1.27k stars 539 forks source link

rbd node service: flattern image when it has references to parent #1543

Closed pkalever closed 3 years ago

pkalever commented 4 years ago

Describe the bug

Currently, as part of node service, we add rbd flatten task for new PVC creates. Ideally, we should add a flatten task only for snapshots/cloned PVCs as required.


[0] pkalever 😎 rbd✨ kubectl describe pods csi-rbd-demo-pod
[...]
Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               10m                  default-scheduler        Successfully assigned default/csi-rbd-demo-pod to minikube
  Normal   SuccessfulAttachVolume  10m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-cec90cb9-8273-4eb1-b2fa-32d3206a1f7d"
  Warning  FailedMount             113s (x12 over 10m)  kubelet, minikube        MountVolume.MountDevice failed for volume "pvc-cec90cb9-8273-4eb1-b2fa-32d3206a1f7d" : rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten rbd-pool/csi-vol-1dae0d96-0238-11eb-93fe-0242ac110004 --id admin --keyfile=***stripped*** -m 192.168.121.136]
  Warning  FailedMount             93s (x4 over 8m21s)  kubelet, minikube        Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc default-token-2xr4n]: timed out waiting for the condition

Environment details

[root@minikube /]# cephcsi --version
Cephcsi Version: canary
Git Commit: fd4328cd5333f4275be52c604c30801fc612fa75
Go Version: go1.15
Compiler: gc
Platform: linux/amd64
Kernel: 4.19.114
[root@minikube /]# 

[root@ceph-node1 ~]# rados lspools
rbd-pool
cephfs-datapool
cephfs-metapool
[root@ceph-node1 ~]# rbd ls -l rbd-pool
NAME                                         SIZE  PARENT FMT PROT LOCK 
csi-vol-e15f6413-023f-11eb-93fe-0242ac110004 1 GiB          2           
image1                                       1 GiB          2           
[root@ceph-node1 ~]# rbd info rbd-pool/csi-vol-e15f6413-023f-11eb-93fe-0242ac110004
rbd image 'csi-vol-e15f6413-023f-11eb-93fe-0242ac110004':
        size 1 GiB in 256 objects
        order 22 (4 MiB objects)
        snapshot_count: 0
        id: 10d75105603c
        block_name_prefix: rbd_data.10d75105603c
        format: 2
        features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
        op_features: 
        flags: 
        create_timestamp: Tue Sep 29 10:38:15 2020
        access_timestamp: Tue Sep 29 10:38:15 2020
        modify_timestamp: Tue Sep 29 10:38:15 2020
[root@ceph-node1 ~]# 

Steps to reproduce

[0] pkalever 😎 rbd✨ git diff 
index d4305b58e..2bab278f0 100644
--- a/examples/rbd/storageclass.yaml
+++ b/examples/rbd/storageclass.yaml
[...]
    # (optional) RBD image features, CSI creates image with image-format 2
    # CSI RBD currently supports only `layering` feature.
-   imageFeatures: layering
+   # imageFeatures: layering

[0] pkalever 😎 rbd✨ kubectl create -f storageclass.yaml
storageclass.storage.k8s.io/csi-rbd-sc created

[0] pkalever 😎 rbd✨ kubectl create -f pvc.yaml
persistentvolumeclaim/rbd-pvc created

[0] pkalever 😎 rbd✨ kubectl get pvc
NAME      STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
rbd-pvc   Bound    pvc-cec90cb9-8273-4eb1-b2fa-32d3206a1f7d   1Gi        RWO            csi-rbd-sc     5s

[0] pkalever 😎 rbd✨ kubectl create -f pod.yaml 
pod/csi-rbd-demo-pod created                     

[0] pkalever 😎 rbd✨ kubectl get pods
NAME                                         READY   STATUS              RESTARTS   AGE
csi-rbd-demo-pod                             0/1     ContainerCreating   0          10m
csi-rbdplugin-hbpq5                          3/3     Running             0          17m
csi-rbdplugin-provisioner-75485f85db-5frvk   6/6     Running             0          17m
csi-rbdplugin-provisioner-75485f85db-sfkl4   6/6     Running             0          17m
csi-rbdplugin-provisioner-75485f85db-zwsfp   6/6     Running             0          17m
vault-867cf4b4d4-qqdt2                       1/1     Running             0          17m
vault-init-job-96d9q                         0/1     Completed           0          17m

[0] pkalever 😎 rbd✨ kubectl describe pods csi-rbd-demo-pod
[...]
Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               10m                  default-scheduler        Successfully assigned default/csi-rbd-demo-pod to minikube
  Normal   SuccessfulAttachVolume  10m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-cec90cb9-8273-4eb1-b2fa-32d3206a1f7d"
  Warning  FailedMount             113s (x12 over 10m)  kubelet, minikube        MountVolume.MountDevice failed for volume "pvc-cec90cb9-8273-4eb1-b2fa-32d3206a1f7d" : rpc error: code = Internal desc = an error (exit status 2) occurred while running ceph args: [rbd task add flatten rbd-pool/csi-vol-1dae0d96-0238-11eb-93fe-0242ac110004 --id admin --keyfile=***stripped*** -m 192.168.121.136]
  Warning  FailedMount             93s (x4 over 8m21s)  kubelet, minikube        Unable to attach or mount volumes: unmounted volumes=[mypvc], unattached volumes=[mypvc default-token-2xr4n]: timed out waiting for the condition

Actual results

Flatten task is added for new PVC

Expected behavior

No Flattern task

Madhu-1 commented 3 years ago

@pkalever i tried to reproduce it on ceph octopus but am not able to do it

i.imagename:csi-vol-aee99548-1809-11eb-a07a-826f7defc52c csi.volname:pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1])
I1027 04:05:43.332769       1 rbd_journal.go:435] ID: 17 Req-ID: pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1 generated Volume ID (0001-0009-rook-ceph-0000000000000002-aee99548-1809-11eb-a07a-826f7defc52c) and image name (csi-vol-aee99548-1809-11eb-a07a-826f7defc52c) for request name (pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1)
I1027 04:05:43.332861       1 rbd_util.go:200] ID: 17 Req-ID: pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1 rbd: create replicapool/csi-vol-aee99548-1809-11eb-a07a-826f7defc52c size 1024M (features: []) using mon 10.107.158.84:6789
I1027 04:05:43.353568       1 controllerserver.go:465] ID: 17 Req-ID: pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1 created volume pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1 backed by image csi-vol-aee99548-1809-11eb-a07a-826f7defc52c
I1027 04:05:43.375052       1 omap.go:136] ID: 17 Req-ID: pvc-6893d3e8-eee6-4856-bef0-d2bcea21e4c1 set omap keys (pool="replicapool", namespace="", name="csi.volume.aee99548-1809-11eb-a07a-826f7defc52c"): map[csi.imageid:113e70f8d035])
sh-4.4# rbd info csi-vol-aee99548-1809-11eb-a07a-826f7defc52c --pool=replicapool
rbd image 'csi-vol-aee99548-1809-11eb-a07a-826f7defc52c':
    size 1 GiB in 256 objects
    order 22 (4 MiB objects)
    snapshot_count: 0
    id: 113e70f8d035
    block_name_prefix: rbd_data.113e70f8d035
    format: 2
    features: layering
    op_features: 
    flags: 
    create_timestamp: Tue Oct 27 04:05:43 2020
    access_timestamp: Tue Oct 27 04:05:43 2020
    modify_timestamp: Tue Oct 27 04:05:43 2020
sh-4.4# ceph version
ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)

this is with a cephcsi canary image. let me know still you are able to reproduce it. I would like to check a few things.

cjheppell commented 3 years ago

Is there any update on this?

I discussed seeing flattening happening in #1800 but it was mentioned that none of the operations there would cause flattening.

As it stands right now, if I create a PVC, then snapshot that PVC, then create a clone of the snapshot and try to mount it I'm getting this error from a kubectl describe of a pod trying to use that cloned PVC:

Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               73s                default-scheduler        Successfully assigned rook/busybox-sleep to minikube
  Normal   SuccessfulAttachVolume  74s                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-d9e9bde4-ec56-4f6b-8d0c-2928b66df5d7"
  Warning  FailedMount             35s (x6 over 58s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-d9e9bde4-ec56-4f6b-8d0c-2928b66df5d7" : rpc error: code = Internal desc = flatten in progress: flatten is in progress for image csi-vol-b3f39645-709a-11eb-8f85-0242ac110010

It is eventually able to enter a running state, but this is due to the flatten operation completing.

I don't want any flattening to occur, as it defeats the point of me using cloning altogether 😞

I did see this in the output of my csi-rbdplugin logs though:

E0216 20:11:48.159792    6417 util.go:232] kernel 4.19.157 does not support required features
E0216 20:11:48.753455    6417 utils.go:136] ID: 274 Req-ID: 0001-0005-rook-0000000000000002-2e26c0b6-7093-11eb-be63-0242ac110010 GRPC error: rpc error: code = Internal desc = flatten in progress: flatten is in progress for image csi-vol-2e26c0b6-7093-11eb-be63-0242ac110010

Is this flattening happening because I'm running a kernel that doesn't support deep flatten? 🤔

cjheppell commented 3 years ago

Ah, it seems this comment suggests you must have kernel 5.1+ to avoid a full flatten: https://github.com/ceph/ceph-csi/pull/693#issuecomment-640067191

Presumably this is the problem, as minikube is using 4.19?

Madhu-1 commented 3 years ago

@cjheppell kernel less than 5.1+ does not support mapping of rbd images with deep-flatten image feature for that we need to flatten the image first and map it on the node.

cjheppell commented 3 years ago

Was this a change between v2.1.x and v3?

As described in #1800, when I performed the same actions on v2.1.2 I didn't see this flattening behaviour.

Madhu-1 commented 3 years ago

Yes, this is a change in v3.x as we reworked the rbd snapshot and clone implementation.

cjheppell commented 3 years ago

Presumably that's what the "Snapshot Alpha is no longer supported" in the v3.0.0 release notes is referring to? https://github.com/ceph/ceph-csi/releases/tag/v3.0.0

I must admit, this is very surprising and completely unexpected behaviour as a user.

It seems that unless I'm on a kernel 5.1+ then cloning from snapshots is fundamentally not performing the copy-on-write behaviour that Ceph claims to offer. Even moreso, that's very hidden from me as from glancing at the behaviour in Kubernetes it appears that cloning is working. But it's only when I mount the clone that the flatten is revealed to me.

If that snapshot contains hundreds of gigabytes of data, then that operation is likely to take a very long time.

Even moreso, the only way I was able to determine that I needed a 5.1+ kernel is by digging through issues and pull request comments.

Could this perhaps be documented more clearly somewhere? It would've saved me an awful lot of time from digging through the lines of code and various pull requests associated with this behaviour.

Madhu-1 commented 3 years ago

Presumably that's what the "Snapshot Alpha is no longer supported" in the v3.0.0 release notes is referring to? https://github.com/ceph/ceph-csi/releases/tag/v3.0.0

I must admit, this is very surprising and completely unexpected behaviour as a user.

It seems that unless I'm on a kernel 5.1+ then cloning from snapshots is fundamentally not performing the copy-on-write behaviour that Ceph claims to offer. Even moreso, that's very hidden from me as from glancing at the behaviour in Kubernetes it appears that cloning is working. But it's only when I mount the clone that the flatten is revealed to me.

in kubernetes, both snapshot and pvc are the independent objects. this is a new design (v3.x+) to handle that. rbd clone will be created when a user requests kubernetes snapshots.

If that snapshot contains hundreds of gigabytes of data, then that operation is likely to take a very long time.

Even moreso, the only way I was able to determine that I needed a 5.1+ kernel is by digging through issues and pull request comments.

yes as the clones are created with the deep-flatten feature if the kernel version is less than 5.1 the nodeplugin tries to flatten the image and then maps it. you also have an option to flatten the image during the snapshot create operation itself rbdsoftmaxclonedepth need to be set to 1 for that.

Could this perhaps be documented more clearly somewhere? It would've saved me an awful lot of time from digging through the lines of code and various pull requests associated with this behaviour.

Yes will update the documentation for the minimum required kernel version to support snapshot and clone

cjheppell commented 3 years ago

in kubernetes, both snapshot and pvc are the independent objects. this is a new design (v3.x+) to handle that. rbd clone will be created when a user requests kubernetes snapshots.

Quite right, but given I'm using a Ceph driver to fulfil the operations of k8s concept of snapshot/clone I'd still expect the behaviour to represent that documented in Ceph's own snapshot/clone semantics. It appears this is true for kernels 5.1+ on v3.x.x, and it was true for kernels <5.1 on releases v2.1.x but is no longer the case for kernels <5.1 on v3.x.x releases.

My point is that as a user, one of the important features Ceph offers is unavailable to me unless some prerequisites are met, and those prerequisites aren't clear.

Perhaps this behaviour could be also be opt-in? I'm aware that kubernetes presents the relationship between snapshot and pvc as independent, but if I consciously acknowledge that that hidden relationship is present then we could avoid the need for flatten for kernels <5.1 on v3.x.x releases?

Yes will update the documentation for the minimum required kernel version to support snapshot and clone

Many thanks. That will be very helpful.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.