Open ShyamsundarR opened 1 month ago
In this instance: https://github.com/RamenDR/ramen/actions/runs/11558180380
We have a ceph core dump:
% ls -lh gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12
-rw-------@ 1 nsoffer staff 1.5G Oct 28 18:33 gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/core.12
cat gather.11558180380-1/rdr-dr2/addons/rook/logs/rdr-dr2/ceph-mgr.a.log
...
2024-10-28T16:33:13.231+0000 7ffb02ebc640 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos
9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: In function 'virtual ceph::common::RefCountedObject::~RefCountedOb
ject()' thread 7ffb02ebc640 time 2024-10-28T16:33:13.224069+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/19.2.0/rpm/el9/BUILD/ceph-19.2.0/src/common/RefCountedObj.cc: 14: FAILED ceph_assert(nref == 0)
ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x12e) [0x7ffb445d2d86]
2: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
3: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
4: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
5: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
6: ceph-mgr(+0x19da74) [0x559b6ab47a74]
7: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
8: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
9: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]
2024-10-28T16:33:13.235+0000 7ffb02ebc640 -1 *** Caught signal (Aborted) **
in thread 7ffb02ebc640 thread_name:OpHistorySvc
ceph version 19.2.0 (16063ff2022298c9300e49a547a16ffda59baf13) squid (stable)
1: /lib64/libc.so.6(+0x3e730) [0x7ffb43f65730]
2: /lib64/libc.so.6(+0x8ba6c) [0x7ffb43fb2a6c]
3: raise()
4: abort()
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x188) [0x7ffb445d2de0]
6: /usr/lib64/ceph/libceph-common.so.2(+0x182f44) [0x7ffb445d2f44]
7: /usr/lib64/ceph/libceph-common.so.2(+0x279259) [0x7ffb446c9259]
8: /usr/lib64/ceph/libceph-common.so.2(+0x3e69cb) [0x7ffb448369cb]
9: (ceph::common::RefCountedObject::put() const+0x1b2) [0x7ffb446cb182]
10: ceph-mgr(+0x19da74) [0x559b6ab47a74]
11: (OpHistoryServiceThread::entry()+0x124) [0x559b6abafa84]
12: /lib64/libc.so.6(+0x89d22) [0x7ffb43fb0d22]
13: /lib64/libc.so.6(+0x10ed40) [0x7ffb44035d40]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Another case when we had a ceph core dump: https://github.com/RamenDR/ramen/actions/runs/11603228347
This fails very regularly, so not really a flake, but sticking to the keyword flake.
Failure log looks like so:
The failure seems to stem from failover cluster not reporting DataReady, such that the workload can be deployed to the cluster. This happens as per the following logs usually:
RS not being setup till there is a pod is fine, but should not get to vote into DataReady. Need to root cause this in the code to understand the behavior and correct the same.
Instances: