I/O error leaves zpool suspended

matthew-richerson commented 1 year ago

From https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/32#issuecomment-1542975829

We've also now seen one case where ZFS logged write errors against one of the NVMe devices. Since the pool isn't configured with any redundancy it was suspended. It's not clear to me from the console logs exactly why an I/O error was reported, but with filesystems be created and destroyed all the time the namespaces are being scanned frequently.

pool: zb7f53d5-ostpool-0 state: SUSPENDED status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC config:

NAME                STATE     READ WRITE CKSUM
zb7f53d5-ostpool-0  ONLINE       0     0     0
  nvme16n1          ONLINE       0     4     0 <<<<<
  nvme15n1          ONLINE       0     0     0
  nvme3n1           ONLINE       0     0     0
  nvme10n1          ONLINE       0     0     0
  nvme14n1          ONLINE       0     0     0
  nvme7n1           ONLINE       0     0     0
  nvme2n1           ONLINE       0     0     0
  nvme11n1          ONLINE       0     0     0
  nvme5n1           ONLINE       0     0     0
  nvme1n1           ONLINE       0     0     0
  nvme8n1           ONLINE       0     0     0
  nvme13n1          ONLINE       0     0     0
  nvme9n1           ONLINE       0     0     0
  nvme12n1          ONLINE       0     0     0
  nvme4n1           ONLINE       0     0     0
  nvme0n1           ONLINE       0     0     0

matthew-richerson commented 1 year ago

It would be good to get to the bottom of the I/O errors from the device. Would you also like to pursue setting up a redundant zpool, or is that not desirable for performance reasons?

behlendorf commented 1 year ago

I was able to run this down today. These errors were caused by the following chain of events.

1) The ZFS Event Daemon (ZED) actively monitors udev to detect device hotplug events. 2) When an NVMe namespace is removed a udev "remove" event is generated. It appears that re-scanning the NVMe namespace can also generate spurious remove events 3) The ZED will act of any these hotplug events and mark the device as removed. 4) Subsequent IO to the vdev will fail with ENXIO and be logged as write errors 5) Because the pool is configured without redundancy it will be SUSPENDED 6) And because the multihost=on property is set the pool may not be resumed with zpool clear forcing a reboot.

For the moment I've disabled the zfs-zed systemd service to prevent this.

behlendorf commented 4 months ago

This was resolved in the OpenZFS code as of v2.1.11. Commit openzfs/zfs@577e835f30c9b92ed8126eb4e8fb17cb0e411c04.

NearNodeFlash / NearNodeFlash.github.io

I/O error leaves zpool suspended #60