NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/
Apache License 2.0
3 stars 3 forks source link

lustre teardown hangs when `zpool list` comes up empty #64

Closed bdevcich closed 1 year ago

bdevcich commented 1 year ago

When deleting zfs backed lustre filesystems, if the zpool destroy returns failure but still deletes, the next deletion attempt will fail at zpool list since the pool no longer exists.

Here's the critical bit from the nnf-node-manager logs.

Internal Error: Error Running Command 'zpool list z7dd9556-ostpool-0', StdErr: cannot open 'z7dd9556-ostpool-0': no such pool                {"NnfNodeStorage": "hetchy44/default-fluxjob-294844659107103744-0-ost-0"}   

And sure enough there's no pool by that name.

NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
za1dc258-mdtpool-0  3.50G  8.92M  3.49G        -        1G     1%     0%  1.00x    ONLINE  -
za1dc258-ostpool-0    14G  51.2M  13.9G        -        1G     9%     0%  1.00x    ONLINE  -

In this case, the zool destroy failed, but didn't leave evidence as to why it failed. On the subsequent attempts, the zpool list repeatedly fails:

From commands.log from inside of the nnf-node-manager pod:

time="2023-05-22T23:16:29-07:00" level=info command="zpool list z7dd9556-ostpool-0"
time="2023-05-22T23:16:29-07:00" level=info command="zpool list z7dd9556-ostpool-0" error="<nil>" response="\"NAME                 SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT\\nz7dd9556-ostpool-0   120G   115G  4.70G        -         -    68%    96%  1.00x    ONLINE  -\\n\""
time="2023-05-22T23:16:29-07:00" level=info command="zpool destroy z7dd9556-ostpool-0"
time="2023-05-22T23:18:02-07:00" level=info command="zpool destroy z7dd9556-ostpool-0" error="Error Running Command 'zpool destroy z7dd9556-ostpool-0'" response="\"\""
time="2023-05-22T23:18:02-07:00" level=info command="zpool list z7dd9556-ostpool-0"
time="2023-05-22T23:18:02-07:00" level=info command="zpool list z7dd9556-ostpool-0" error="Error Running Command 'zpool list z7dd9556-ostpool-0', StdErr: cannot open 'z7dd9556-ostpool-0': no such pool\n" response="\"\""
time="2023-05-22T23:18:02-07:00" level=info command="zpool list z7dd9556-ostpool-0"
time="2023-05-22T23:18:02-07:00" level=info command="zpool list z7dd9556-ostpool-0" error="Error Running Command 'zpool list z7dd9556-ostpool-0', StdErr: cannot open 'z7dd9556-ostpool-0': no such pool\n" response="\"\""
...
bdevcich commented 1 year ago

In this case, the zpool destroy timed out after 90s. However, the destroy eventually worked. On the next deletion attempt, zpool list came up empty and caused Delete() to fail.

I see two issues here:

  1. We're not properly logging the timeout
  2. On Deletion, we fail if zpool list returns 'no such pool'