Redeployed GW1’s nvme daemon using ceph orch daemon redeploy, and IOs stop executing even after daemon is back.
[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# ceph orch daemon redeploy nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv
Scheduled to redeploy nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv on host 'ceph-rbd2-mytest-bgzmwr-node4'
[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# ceph orch ps --daemon-type nvmeof
NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID
nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv ceph-rbd2-mytest-bgzmwr-node4 *:5500,4420,8009 running (48s) 45s ago 2h 52.7M - 1.0.0 a647a0311a69 d18b3036f3f1
nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node5.gqqfhf ceph-rbd2-mytest-bgzmwr-node5 *:5500,4420,8009 running (2h) 10m ago 2h 143M - 1.0.0 a647a0311a69 1878bdf5d0f2
Disks related to GW1 disappear from nvme list on client
[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# nvme list
Node Generic SN Model Namespace Usage Format FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- ---------- -------------------------- ---------------- --------
/dev/nvme3n5 /dev/ng3n5 2 Ceph bdev Controller 0x5 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme3n4 /dev/ng3n4 2 Ceph bdev Controller 0x4 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme3n3 /dev/ng3n3 2 Ceph bdev Controller 0x3 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme3n2 /dev/ng3n2 2 Ceph bdev Controller 0x2 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
/dev/nvme3n1 /dev/ng3n1 2 Ceph bdev Controller 0x1 536.87 GB / 536.87 GB 512 B + 0 B 23.01.1
Ana-group 2 becomes inaccessible on both GW1 and GW2
GW1
This is also related to the known RM issues. There are already few issues opened on that. It is a known issue, and a fix will be provided to fix it soon.
Redeploying nvme daemon makes the ana-group for the corresponding gateway inaccessible, and the corresponding nvme disks disappear from the client.
Before failover
GW Info
Redeployed GW1’s nvme daemon using ceph orch daemon redeploy, and IOs stop executing even after daemon is back.
Disks related to GW1 disappear from nvme list on client
Ana-group 2 becomes inaccessible on both GW1 and GW2 GW1
GW2
Rbd perf io stats and namespace io_stats become 0 on both GWs
GW2