RamenDR / ramen

Apache License 2.0
73 stars 53 forks source link

drenv rbd-mirror addon: Timeout waiting for mirroring health #1332

Closed nirs closed 4 months ago

nirs commented 5 months ago

We have a random issue when rbd-mirror cannot connect to the remote peer, and we time out waiting for daemon health after 600 seconds.

When this happens, we see ERROR status in rbd mirror pool status:

$ kubectl rook-ceph --context dr2 rbd mirror pool status -p replicapool --verbose
health: ERROR
daemon health: ERROR
image health: OK
images: 0 total

DAEMONS
service 4361:
  instance_id: 4408
  client_id: a
  hostname: dr2
  version: 18.2.2
  leader: true
  health: ERROR
  callouts: unable to connect to remote cluster

In rbd-mirror log we can see:

8287-356f-4f81-87dc-51bb05942553.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin

debug 2024-04-07T05:18:11.585+0000 7fc86d4808c0  0 rbd::mirror::PoolReplayer: 0x5589c90dc000
init_rados: reverting global config option override: mon_host:
[v2:192.168.122.98:3300,v1:192.168.122.98:6789] ->

unable to get monitor info from DNS SRV with service name: ceph-mon

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 failed for service _ceph-mon._tcp

debug 2024-04-07T05:18:11.602+0000 7fc86d4808c0 -1 monclient: get_monmap_and_config cannot
identify monitors to contact

After restarting the daemon it works normally.

Improve logging

We need to configure rbd-mirror logging like ODF: https://github.com/red-hat-storage/ocs-operator/blob/4a0325d824a409e84fac21ffbf0a1338971d1a70/controllers/storagecluster/cephrbdmirrors.go#L142-L232

ceph config set client.rbd-mirror.a debug_ms 1
ceph config set client.rbd-mirror.a debug_rbd 15
ceph config set client.rbd-mirror.a debug_rbd_mirror 30
ceph config set client.rbd-mirror.a log_file /var/log/ceph/\$cluster-\$name.log
ceph config set client.rbd-mirror-peer debug_ms 1
ceph config set client.rbd-mirror-peer debug_rbd 15
ceph config set client.rbd-mirror-peer debug_rbd_mirror 30
ceph config set client.rbd-mirror-peer log_file /var/log/ceph/\$cluster-\$name.log
ceph config set mgr mgr/rbd_support/log_level debug

Hopefully this works using rook-ceph-override: https://rook.io/docs/rook/latest-release/Storage-Configuration/Advanced/ceph-configuration/#example

Tasks