ceph / ceph-nvmeof

Service to provide Ceph storage over NVMe-oF/TCP protocol
GNU Lesser General Public License v3.0
74 stars 40 forks source link

NVMe initiator disconnects and the nvme disks on the stopped gateway node disappear from the client post failback #515

Closed manasagowri closed 3 months ago

manasagowri commented 3 months ago

Steps performed.

  1. Failover one gateway by performing ceph orch daemon stop for the corresponding gateway daemon
  2. Failover completes successfully and gateway 2 takes over the IOs
  3. Start the daemon using ceph orch daemon start
  4. Failback also completes at gateway level and the optimized path is restored back to gateway 1
  5. However the nvme client disconnects and shuts down. Post failback the nvme disks disappear from the client (checked nvme list). IOs get stopped on the client and IOs are not seen in both nvme io stats and rbd perf image stats as well. nvme list output gives an error

Logs

GW details

GW 1 info

[root@ceph-rbd2-mytest-bgzmwr-node4 ~]# podman run quay.io/barakda1/nvmeof-cli:8677ba3 --server-address 10.0.103.241 --server-port 5500 gw info
CLI's version: 1.0.0
Gateway's version: 1.0.0
Gateway's name: client.nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv
Gateway's load balancing group: 2
Gateway's address: 10.0.103.241
Gateway's port: 5500
SPDK version: 23.01.1

GW2 info

[root@ceph-rbd2-mytest-bgzmwr-node5 ~]# podman run quay.io/barakda1/nvmeof-cli:8677ba3 --server-address 10.0.100.215 --server-port 5500 gw info
CLI's version: 1.0.0
Gateway's version: 1.0.0
Gateway's name: client.nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node5.gqqfhf
Gateway's load balancing group: 1
Gateway's address: 10.0.100.215
Gateway's port: 5500
SPDK version: 23.01.1

Disks optimized in GW1

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# nvme list-subsys /dev/nvme1n1
nvme-subsys1 - NQN=nqn.2016-06.io.spdk:cnode1
\
 +- nvme1 tcp traddr=10.0.103.241,trsvcid=4420,src_addr=10.0.100.13 live optimized
 +- nvme2 tcp traddr=10.0.100.215,trsvcid=4420,src_addr=10.0.100.13 live inaccessible

IO stats on GW1

[root@ceph-rbd2-mytest-bgzmwr-node4 ~]# podman run quay.io/barakda1/nvmeof-cli:8677ba3 --server-address 10.0.103.241 --server-port 5500 namespace get_io_stats -n nqn.2016-06.io.spdk:cnode1 --nsid 2
IO statistics for namespace 2 in nqn.2016-06.io.spdk:cnode1, bdev bdev_66011022-9f2b-4c1c-9174-673ce9a401d5:
╒═════════════════════════╤═════════════════╕
│ Stat                    │ Value           │
╞═════════════════════════╪═════════════════╡
│ Tick Rate               │ 2190000000      │
├─────────────────────────┼─────────────────┤
│ Ticks                   │ 335515753386311 │
├─────────────────────────┼─────────────────┤
│ Bytes Read              │ 4666368         │
├─────────────────────────┼─────────────────┤
│ Num Read Ops            │ 494             │
├─────────────────────────┼─────────────────┤
│ Bytes Written           │ 11517952        │
├─────────────────────────┼─────────────────┤
│ Num Write Ops           │ 518             │
├─────────────────────────┼─────────────────┤
│ Bytes Unmapped          │ 536870916096    │
├─────────────────────────┼─────────────────┤
│ Num Unmap Ops           │ 251             │
├─────────────────────────┼─────────────────┤
│ Read Latency Ticks      │ 2043603612      │
├─────────────────────────┼─────────────────┤
│ Max Read Latency Ticks  │ 23825706        │
├─────────────────────────┼─────────────────┤
│ Min Read Latency Ticks  │ 104272          │
├─────────────────────────┼─────────────────┤
│ Write Latency Ticks     │ 1998546166      │
├─────────────────────────┼─────────────────┤
│ Max Write Latency Ticks │ 64096730        │
├─────────────────────────┼─────────────────┤
│ Min Write Latency Ticks │ 143726          │
├─────────────────────────┼─────────────────┤
│ Unmap Latency Ticks     │ 2420620072      │
├─────────────────────────┼─────────────────┤
│ Max Unmap Latency Ticks │ 19421916        │
├─────────────────────────┼─────────────────┤
│ Min Unmap Latency Ticks │ 349844          │
├─────────────────────────┼─────────────────┤
│ Copy Latency Ticks      │ 0               │
├─────────────────────────┼─────────────────┤
│ Max Copy Latency Ticks  │ 0               │
├─────────────────────────┼─────────────────┤
│ Min Copy Latency Ticks  │ 0               │
├─────────────────────────┼─────────────────┤
│ IO Error                │ []              │
╘═════════════════════════╧═════════════════╛

IO stats on GW2

[root@ceph-rbd2-mytest-bgzmwr-node5 ~]# podman run quay.io/barakda1/nvmeof-cli:8677ba3 --server-address 10.0.100.215 --server-port 5500 namespace get_io_stats -n nqn.2016-06.io.spdk:cnode1 --nsid 1
IO statistics for namespace 1 in nqn.2016-06.io.spdk:cnode1, bdev bdev_fef1765d-e570-4149-b218-37c5542c0a7b:
╒═════════════════════════╤═════════════════╕
│ Stat                    │ Value           │
╞═════════════════════════╪═════════════════╡
│ Tick Rate               │ 2190000000      │
├─────────────────────────┼─────────────────┤
│ Ticks                   │ 335390595122035 │
├─────────────────────────┼─────────────────┤
│ Bytes Read              │ 36864           │
├─────────────────────────┼─────────────────┤
│ Num Read Ops            │ 2               │
├─────────────────────────┼─────────────────┤
│ Bytes Written           │ 0               │
├─────────────────────────┼─────────────────┤
│ Num Write Ops           │ 0               │
├─────────────────────────┼─────────────────┤
│ Bytes Unmapped          │ 0               │
├─────────────────────────┼─────────────────┤
│ Num Unmap Ops           │ 0               │
├─────────────────────────┼─────────────────┤
│ Read Latency Ticks      │ 3118744         │
├─────────────────────────┼─────────────────┤
│ Max Read Latency Ticks  │ 1647468         │
├─────────────────────────┼─────────────────┤
│ Min Read Latency Ticks  │ 1471276         │
├─────────────────────────┼─────────────────┤
│ Write Latency Ticks     │ 0               │
├─────────────────────────┼─────────────────┤
│ Max Write Latency Ticks │ 0               │
├─────────────────────────┼─────────────────┤
│ Min Write Latency Ticks │ 0               │
├─────────────────────────┼─────────────────┤
│ Unmap Latency Ticks     │ 0               │
├─────────────────────────┼─────────────────┤
│ Max Unmap Latency Ticks │ 0               │
├─────────────────────────┼─────────────────┤
│ Min Unmap Latency Ticks │ 0               │
├─────────────────────────┼─────────────────┤
│ Copy Latency Ticks      │ 0               │
├─────────────────────────┼─────────────────┤
│ Max Copy Latency Ticks  │ 0               │
├─────────────────────────┼─────────────────┤
│ Min Copy Latency Ticks  │ 0               │
├─────────────────────────┼─────────────────┤
│ IO Error                │ []              │
╘═════════════════════════╧═════════════════╛

RBD perf image stat

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# rbd perf image iostat nvme_test
rbd: waiting for initial image stats
NAME       WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_1  88/s   0/s   165 MiB/s      0 B/s   56.32 ms   0.00 ns 
image_3   1/s   0/s    29 KiB/s      0 B/s   69.61 ms   0.00 ns 
image_2   1/s   0/s    29 KiB/s      0 B/s   26.38 ms   0.00 ns 
NAME       WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_1  88/s   0/s   164 MiB/s      0 B/s   56.80 ms   0.00 ns 
image_2   1/s   0/s    29 KiB/s      0 B/s   50.84 ms   0.00 ns 
image_3   1/s   0/s   5.6 KiB/s      0 B/s   28.96 ms   0.00 ns 
image_0   0/s   0/s   3.2 KiB/s      0 B/s   27.17 ms   0.00 ns 
image_4   0/s   0/s   3.2 KiB/s      0 B/s   20.83 ms   0.00 ns 
NAME        WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_1  101/s   0/s   169 MiB/s      0 B/s   56.96 ms   0.00 ns 
image_0    1/s   0/s   3.2 KiB/s      0 B/s   66.98 ms   0.00 ns 
image_3    0/s   0/s   2.4 KiB/s      0 B/s   26.22 ms   0.00 ns 
image_4    0/s   0/s   3.2 KiB/s      0 B/s   16.06 ms   0.00 ns 
image_2    0/s   0/s       0 B/s      0 B/s   12.39 ms   0.00 ns 

rpc listener list on GW1

[root@ceph-rbd2-mytest-bgzmwr-node4 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode1 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.103.241",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
[root@ceph-rbd2-mytest-bgzmwr-node4 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode2 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.103.241",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"

rpc listener list on GW2

[root@ceph-rbd2-mytest-bgzmwr-node5 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode1 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.100.215",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "optimized"
      },
      {
        "ana_group": 2,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
[root@ceph-rbd2-mytest-bgzmwr-node5 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode2 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.100.215",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "optimized"
      },
      {
        "ana_group": 2,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"

Failover occured by performing ceph orch daemon stop for node4

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# ceph orch daemon stop nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv
Scheduled to stop nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv on host 'ceph-rbd2-mytest-bgzmwr-node4'

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# ceph orch ps --daemon-type nvmeof
NAME                                                HOST                           PORTS             STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID      CONTAINER ID  
nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node4.cuatjv  ceph-rbd2-mytest-bgzmwr-node4  *:5500,4420,8009  error             1s ago  20m        -        -  <unknown>  <unknown>     <unknown>     
nvmeof.nvmeof.ceph-rbd2-mytest-bgzmwr-node5.gqqfhf  ceph-rbd2-mytest-bgzmwr-node5  *:5500,4420,8009  running (20m)     9m ago  20m     122M        -  1.0.0      a647a0311a69  1878bdf5d0f2

IOs stop for a few seconds and then restart

NAME       WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_2  55/s   0/s    93 MiB/s      0 B/s   63.12 ms   0.00 ns 
image_4   0/s   0/s   2.4 KiB/s      0 B/s    9.87 ms   0.00 ns 
image_1   0/s   0/s       0 B/s      0 B/s   36.30 ms   0.00 ns 
image_3   0/s   0/s       0 B/s      0 B/s    2.60 ms   0.00 ns 
image_0   0/s   0/s       0 B/s      0 B/s    2.37 ms   0.00 ns 
NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 
NAME  WR   RD   WR_BYTES   RD_BYTES   WR_LAT   RD_LAT 
NAME       WR    RD    WR_BYTES   RD_BYTES     WR_LAT    RD_LAT 
image_2  18/s   0/s   9.5 MiB/s      0 B/s   37.09 ms   0.00 ns 
image_3   0/s   0/s   3.2 KiB/s      0 B/s   15.92 ms   0.00 ns 
image_0   0/s   0/s   3.2 KiB/s      0 B/s   12.68 ms   0.00 ns 
image_4   0/s   0/s   2.4 KiB/s      0 B/s    4.84 ms   0.00 ns 
image_1   0/s   0/s   3.2 KiB/s      0 B/s    3.03 ms   0.00 ns 

IO stats on node5 can now be seen

[root@ceph-rbd2-mytest-bgzmwr-node5 ~]# podman run quay.io/barakda1/nvmeof-cli:8677ba3 --server-address 10.0.100.215 --server-port 5500 namespace get_io_stats -n nqn.2016-06.io.spdk:cnode1 --nsid 1
IO statistics for namespace 1 in nqn.2016-06.io.spdk:cnode1, bdev bdev_fef1765d-e570-4149-b218-37c5542c0a7b:
╒═════════════════════════╤═════════════════╕
│ Stat                    │ Value           │
╞═════════════════════════╪═════════════════╡
│ Tick Rate               │ 2190000000      │
├─────────────────────────┼─────────────────┤
│ Ticks                   │ 335664960402915 │
├─────────────────────────┼─────────────────┤
│ Bytes Read              │ 36864           │
├─────────────────────────┼─────────────────┤
│ Num Read Ops            │ 2               │
├─────────────────────────┼─────────────────┤
│ Bytes Written           │ 32768           │
├─────────────────────────┼─────────────────┤
│ Num Write Ops           │ 7               │
├─────────────────────────┼─────────────────┤
│ Bytes Unmapped          │ 0               │
├─────────────────────────┼─────────────────┤
│ Num Unmap Ops           │ 0               │
├─────────────────────────┼─────────────────┤
│ Read Latency Ticks      │ 3118744         │
├─────────────────────────┼─────────────────┤
│ Max Read Latency Ticks  │ 1647468         │
├─────────────────────────┼─────────────────┤
│ Min Read Latency Ticks  │ 1471276         │
├─────────────────────────┼─────────────────┤
│ Write Latency Ticks     │ 2671793062      │
├─────────────────────────┼─────────────────┤
│ Max Write Latency Ticks │ 890747594       │
├─────────────────────────┼─────────────────┤
│ Min Write Latency Ticks │ 311752          │
├─────────────────────────┼─────────────────┤
│ Unmap Latency Ticks     │ 0               │
├─────────────────────────┼─────────────────┤
│ Max Unmap Latency Ticks │ 0               │
├─────────────────────────┼─────────────────┤
│ Min Unmap Latency Ticks │ 0               │
├─────────────────────────┼─────────────────┤
│ Copy Latency Ticks      │ 0               │
├─────────────────────────┼─────────────────┤
│ Max Copy Latency Ticks  │ 0               │
├─────────────────────────┼─────────────────┤
│ Min Copy Latency Ticks  │ 0               │
├─────────────────────────┼─────────────────┤
│ IO Error                │ []              │
╘═════════════════════════╧═════════════════╛

Both ana groups are now optimized on GW2

[root@ceph-rbd2-mytest-bgzmwr-node5 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode2 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.100.215",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "optimized"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
[root@ceph-rbd2-mytest-bgzmwr-node5 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode1 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.100.215",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "optimized"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"

After failback

Started the gw daemon back, the ana-group 2 became optimized on GW1 However the client node got disconnected and when I logged back in, nvme list command gave an error.

GW1

[root@ceph-rbd2-mytest-bgzmwr-node4 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode2 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.103.241",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"
[root@ceph-rbd2-mytest-bgzmwr-node4 src]# /usr/libexec/spdk/scripts/rpc.py  nvmf_subsystem_get_listeners nqn.2016-06.io.spdk:cnode1 | head -n 24
[
  {
    "address": {
      "trtype": "TCP",
      "adrfam": "IPv4",
      "traddr": "10.0.103.241",
      "trsvcid": "4420"
    },
    "ana_states": [
      {
        "ana_group": 1,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 2,
        "ana_state": "optimized"
      },
      {
        "ana_group": 3,
        "ana_state": "inaccessible"
      },
      {
        "ana_group": 4,
        "ana_state": "inaccessible"

Client

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# client_loop: send disconnect: Broken pipe

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# nvme list
Failed to scan topology: No such file or directory

[root@ceph-rbd2-mytest-bgzmwr-node6 ~]# lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
sr0     11:0    1  514K  0 rom  
vda    252:0    0   80G  0 disk 
├─vda1 252:1    0    1M  0 part 
├─vda2 252:2    0  200M  0 part /efi
│                               /boot/efi
├─vda3 252:3    0  600M  0 part /boot
└─vda4 252:4    0 79.2G  0 part /
manasagowri commented 3 months ago

dmesg.txt

Attached dmesg output for the client node.

caroav commented 3 months ago

which connect command was used?

caroav commented 3 months ago

@manasagowri did you use this command to connect? nvme connect-all --traddr=<IP> --transport=tcp -l 3600 ? If not, please do and retry.

Also after connecting please provide the following details:

  1. nvme list-subsys
  2. and as you provided already, also the nvme list-subsys for the devices.
caroav commented 3 months ago

Fixed with 1.2.1.