We have a random issue when rbd-mirror cannot connect to the remote
peer, and we time out waiting for daemon health after 600 seconds.
When this happens, we see ERROR status in rbd mirror pool status:
$ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 0 total
DAEMONS
service 4361:
instance_id: 4408
client_id: a
hostname: dr2
version: 18.2.2
leader: true
health: ERROR
callouts: unable to connect to remote cluster
In rbd-mirror log we can see:
8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin
debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0 0 rbd::mirror::PoolReplayer: 0x5589c90dc000
init_rados: reverting global config option override: mon_host:
[v2:192.168.122.98:3300,v1:192.168.122.98:6789] ->
unable to get monitor info from DNS SRV with service name: ceph-mon
debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp
debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot
identify monitors to contact
We have a random issue when rbd-mirror cannot connect to the remote peer, and we time out waiting for daemon health after 600 seconds.
When this happens, we see ERROR status in rbd mirror pool status:
In rbd-mirror log we can see:
After restarting the daemon it works normally.
Improve logging
We need to configure rbd-mirror logging like ODF: https://github.com/red-hat-storage/ocs-operator/blob/4a0325d824a409e84fac21ffbf0a1338971d1a70/controllers/storagecluster/cephrbdmirrors.go#L142-L232
Hopefully this works using rook-ceph-override: https://rook.io/docs/rook/latest-release/Storage-Configuration/Advanced/ceph-configuration/#example
Tasks