Issues when requesting multiple Lustre filesystems

When using a single rabbit node, and requesting multiple mgtmdtpool's per node, it's understood the workload should fail because only a single MGS per node is supported by Lustre. However, asking for such a thing anyway uncovered a few problems.

#DW jobdw capacity=1TiB type=lustre name=test1
#DW jobdw capacity=1TiB type=lustre name=test2

1) As expected the workflow couldn't complete the Setup phase and sets Status=Error. We know it's a fatal error in this case but we still retry forever. Is there some field in the workflow to indicate this is fatal and the WLM should progress to Teardown?

Error: Could not create file share: Error 500: Internal Server Error, Retry-Delay: 0s, Cause: File share
'default-fluxjob-162747711299257344-1-mgtmdt-0-0' failed to create, Internal Error: Error Running Command
'zpool create -O canmount=off -o cachefile=none z23ba99b-mgtmdtpool-0 /dev/nvme14n4 /dev/nvme16n4
/dev/nvme1n4 /dev/nvme2n4 /dev/nvme3n4 /dev/nvme5n4 /dev/nvme9n4 /dev/nvme0n4 /dev/nvme10n4
/dev/nvme11n4 /dev/nvme12n4 /dev/nvme13n4 /dev/nvme15n4 /dev/nvme4n4 /dev/nvme6n4 /dev/nvme7n4',
StdErr: /dev/nvme14n4 is in use and contains a unknown filesystem.                                                          
/dev/nvme16n4 is in use and contains a unknown filesystem.            
...

2) After canceling the workflow the Rabbit storage is left in an inconsistent state. While blocking in the setup we can see 3 of the 4 pools were created.

NAME                    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
z23ba99b-mgtmdtpool-0  3.50G  3.96M  3.50G        -        1G     0%     0%  1.00x    ONLINE  -
zc66c782-mgtmdtpool-0  3.50G  8.67M  3.49G        -        1G     0%     0%  1.00x    ONLINE  -
zc66c782-ostpool-0     1016G  8.50M  1016G        -         -     0%     0%  1.00x    ONLINE  -

Only two of these pools are destroyed during teardown. Furthermore, all of the NVMe namespaces are destroyed from underneath the remaining ZFS pool causing any interaction with that pool to hang. At which point the rabbit needs to be rebooted.

NAME                    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
z23ba99b-mgtmdtpool-0  3.50G  3.96M  3.50G        -        1G     0%     0%  1.00x    ONLINE  -

Only two of these pools are destroyed during teardown. Furthermore, all of the NVMe namespaces are destroyed from underneath the remaining ZFS pool causing any interaction with that pool to hang. At which point the rabbit needs to be rebooted.

It seems early users appear to be able to tickle this issue fairly easily. We're seeing that the storage underneath a Lustre filesystem is occasionally released before the ZFS pool backing that Lustre filesystem is destroyed. When this happens you'll see zpool status reports all of the devices as either UNAVAIL or REMOVED.

  pool: z5e680fc-mdtpool-0
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:

    NAME                STATE     READ WRITE CKSUM
    z5e680fc-mdtpool-0  UNAVAIL      0     0     0  insufficient replicas
      nvme15n8          UNAVAIL      3     1     0
      nvme16n9          UNAVAIL      3     1     0
      nvme8n9           UNAVAIL      3     1     0
      nvme5n8           UNAVAIL      3     1     0
      nvme2n8           UNAVAIL      3     1     0
      nvme7n8           UNAVAIL      3     1     0
      nvme14n8          REMOVED      0     0     0
      nvme9n8           REMOVED      0     0     0
      nvme4n8           REMOVED      0     0     0
      nvme1n8           REMOVED      0     0     0
      nvme12n8          REMOVED      0     0     0
      nvme10n8          REMOVED      0     0     0
      nvme13n8          REMOVED      0     0     0
      nvme0n8           REMOVED      0     0     0
      nvme11n8          UNAVAIL      3     1     0
      nvme3n8           UNAVAIL      3     1     0

We've also now seen one case where ZFS logged write errors against one of the NVMe devices. Since the pool isn't configured with any redundancy it was suspended. It's not clear to me from the console logs exactly why an I/O error was reported, but with filesystems be created and destroyed all the time the namespaces are being scanned frequently.

  pool: zb7f53d5-ostpool-0
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
config:

    NAME                STATE     READ WRITE CKSUM
    zb7f53d5-ostpool-0  ONLINE       0     0     0
      nvme16n1          ONLINE       0     4     0 <<<<<
      nvme15n1          ONLINE       0     0     0
      nvme3n1           ONLINE       0     0     0
      nvme10n1          ONLINE       0     0     0
      nvme14n1          ONLINE       0     0     0
      nvme7n1           ONLINE       0     0     0
      nvme2n1           ONLINE       0     0     0
      nvme11n1          ONLINE       0     0     0
      nvme5n1           ONLINE       0     0     0
      nvme1n1           ONLINE       0     0     0
      nvme8n1           ONLINE       0     0     0
      nvme13n1          ONLINE       0     0     0
      nvme9n1           ONLINE       0     0     0
      nvme12n1          ONLINE       0     0     0
      nvme4n1           ONLINE       0     0     0
      nvme0n1           ONLINE       0     0     0

This GitHub issue has a lot of different symptoms described in it, so I created some new issues to track them individually:

I/O error: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/60 Flux not following DirectiveBreakdown constraints: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/59 Improved error reporting for fatal errors: https://github.com/NearNodeFlash/NearNodeFlash.github.io/issues/46

I'll use this issue to track the NVMe namespace deletion without first deleting the zpool.

I put in a fix to properly clean up a failed Lustre target so that the zpool won't get stranded.

NearNodeFlash / NearNodeFlash.github.io

Issues when requesting multiple Lustre filesystems #32