Closed behlendorf closed 1 year ago
Only two of these pools are destroyed during teardown. Furthermore, all of the NVMe namespaces are destroyed from underneath the remaining ZFS pool causing any interaction with that pool to hang. At which point the rabbit needs to be rebooted.
It seems early users appear to be able to tickle this issue fairly easily. We're seeing that the storage underneath a Lustre filesystem is occasionally released before the ZFS pool backing that Lustre filesystem is destroyed. When this happens you'll see zpool status
reports all of the devices as either UNAVAIL or REMOVED.
pool: z5e680fc-mdtpool-0
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:
NAME STATE READ WRITE CKSUM
z5e680fc-mdtpool-0 UNAVAIL 0 0 0 insufficient replicas
nvme15n8 UNAVAIL 3 1 0
nvme16n9 UNAVAIL 3 1 0
nvme8n9 UNAVAIL 3 1 0
nvme5n8 UNAVAIL 3 1 0
nvme2n8 UNAVAIL 3 1 0
nvme7n8 UNAVAIL 3 1 0
nvme14n8 REMOVED 0 0 0
nvme9n8 REMOVED 0 0 0
nvme4n8 REMOVED 0 0 0
nvme1n8 REMOVED 0 0 0
nvme12n8 REMOVED 0 0 0
nvme10n8 REMOVED 0 0 0
nvme13n8 REMOVED 0 0 0
nvme0n8 REMOVED 0 0 0
nvme11n8 UNAVAIL 3 1 0
nvme3n8 UNAVAIL 3 1 0
We've also now seen one case where ZFS logged write errors against one of the NVMe devices. Since the pool isn't configured with any redundancy it was suspended. It's not clear to me from the console logs exactly why an I/O error was reported, but with filesystems be created and destroyed all the time the namespaces are being scanned frequently.
pool: zb7f53d5-ostpool-0
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:
NAME STATE READ WRITE CKSUM
zb7f53d5-ostpool-0 ONLINE 0 0 0
nvme16n1 ONLINE 0 4 0 <<<<<
nvme15n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
nvme10n1 ONLINE 0 0 0
nvme14n1 ONLINE 0 0 0
nvme7n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme11n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme8n1 ONLINE 0 0 0
nvme13n1 ONLINE 0 0 0
nvme9n1 ONLINE 0 0 0
nvme12n1 ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
This GitHub issue has a lot of different symptoms described in it, so I created some new issues to track them individually:
I/O error: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/60 Flux not following DirectiveBreakdown constraints: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/59 Improved error reporting for fatal errors: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/46
I'll use this issue to track the NVMe namespace deletion without first deleting the zpool.
I put in a fix to properly clean up a failed Lustre target so that the zpool won't get stranded.
When using a single rabbit node, and requesting multiple mgtmdtpool's per node, it's understood the workload should fail because only a single MGS per node is supported by Lustre. However, asking for such a thing anyway uncovered a few problems.
1) As expected the workflow couldn't complete the
Setup
phase and setsStatus=Error
. We know it's a fatal error in this case but we still retry forever. Is there some field in the workflow to indicate this is fatal and the WLM should progress toTeardown
?2) After canceling the workflow the Rabbit storage is left in an inconsistent state. While blocking in the setup we can see 3 of the 4 pools were created.
Only two of these pools are destroyed during teardown. Furthermore, all of the NVMe namespaces are destroyed from underneath the remaining ZFS pool causing any interaction with that pool to hang. At which point the rabbit needs to be rebooted.