e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails

ShyamsundarR commented 1 month ago

This fails very regularly, so not really a flake, but sticking to the keyword flake.

Failure log looks like so:

--- FAIL: TestSuites/Exhaustive/Deploy-cephfs (0.00s)
  --- FAIL: TestSuites/Exhaustive/Deploy-cephfs/Subscr (852.06s)
    --- PASS: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Deploy (35.59s)
    --- PASS: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Enable (210.49s)
    --- FAIL: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover (605.99s)

The failure seems to stem from failover cluster not reporting DataReady, such that the workload can be deployed to the cluster. This happens as per the following logs usually:

2024-09-12T13:14:56.851Z        INFO    controllers.VolumeReplicationGroup      volsync/vshandler.go:385        PVC is not in use by ready pod, not creating RS yet ...   {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "rsSpec": {"protectedPVC":{"namespace":"subscr-deploy-cephfs-busybox","name":"busybox-pvc","protectedByVolSync":true,"storageID":{"id":""},"replicationID":{"id":""},"storageClassName":"rook-cephfs","annotations":{"apps.open-cluster-management.io/hosting-subscription":"subscr-deploy-cephfs-busybox/subscr-deploy-cephfs-busybox","apps.open-cluster-management.io/reconcile-option":"merge"},"labels":{"app":"subscr-deploy-cephfs-busybox","app.kubernetes.io/part-of":"subscr-deploy-cephfs-busybox","appname":"busybox"},"accessModes":["ReadWriteMany"],"resources":{"requests":{"storage":"1Gi"}}}}, "runFinalSync": false}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:121   Not all ReplicationSources completed setup. We'll retry...        {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8b93b-70966f46932e", "State": "primary"}

2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:480   VolSync PVsRestored is not complete yet for PVC busybox-pvc       {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:480   VolSync ReplicationSourceSetup is not complete yet for PVC busybox-pvc    {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/vrg_volsync.go:418   VolSync RS hasn't been setup yet for PVC busybox-pvc      {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/volumereplicationgroup_controller.go:1354    Marking VRG ready with replicating reason {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary", "reason": "Unused"}
2024-09-12T13:14:56.852Z        INFO    controllers.VolumeReplicationGroup.vrginstance  controller/volumereplicationgroup_controller.go:1320    DataReady {"VolumeReplicationGroup": {"name":"subscr-deploy-cephfs-busybox","namespace":"subscr-deploy-cephfs-busybox"}, "rid": "4745ce1e-01a4-41c8-b93b-70966f46932e", "State": "primary", "subconditions": [{"type":"DataReady","status":"False","observedGeneration":3,"lastTransitionTime":null,"reason":"Progressing","message":"Not all VolSync PVCs are ready"},{"type":"DataReady","status":"True","observedGeneration":3,"lastTransitionTime":null,"reason":"Unused","message":"No PVCs are protected using VolumeReplication scheme"}]}

RS not being setup till there is a pod is fine, but should not get to vote into DataReady. Need to root cause this in the code to understand the behavior and correct the same.

Instances:

nirs commented 1 week ago

In this instance: https://github.com/RamenDR/ramen/actions/runs/11558180380

We have a ceph core dump:

% ls -lh gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12
-rw-------@ 1 nsoffer  staff   1.5G Oct 28 18:33 gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12

cat gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/ceph-mgr.a.log
...
2024-10-28T16:33:13.231+0000 7ffb02ebc640 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos
9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedOb
ject()' thread 7ffb02ebc640 time 2024-10-28T16:33:13.224069+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)

 ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7ffb445d2d86]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
 3: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
 4: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
 5: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
 6: ceph-mgr(+0x19da74) [0x559b6ab47a74]
 7: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
 8: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
 9: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]

2024-10-28T16:33:13.235+0000 7ffb02ebc640 -1 *** Caught signal (Aborted) **
 in thread 7ffb02ebc640 thread_name:OpHistorySvc

 ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
 1: /lib64/libc.so.6(+0x3e730) [0x7ffb43f65730]
 2: /lib64/libc.so.6(+0x8ba6c) [0x7ffb43fb2a6c]
 3: raise()
 4: abort()
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7ffb445d2de0]
 6: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
 7: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
 8: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
 9: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
 10: ceph-mgr(+0x19da74) [0x559b6ab47a74]
 11: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
 12: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
 13: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

nirs commented 1 week ago

Another case when we had a ceph core dump: https://github.com/RamenDR/ramen/actions/runs/11603228347

RamenDR / ramen

e2e flake: TestSuites/Exhaustive/Deploy-cephfs/Subscr/Failover fails #1557