Teardown blocked after system crash

behlendorf commented 1 year ago

While an ephemeral Lustre workflow was allocated the rabbit-p node encountered a kernel panic (EDAC PCIe parity error). After the node was rebooted and all of the container services started one workflow was left stuck in Teardown with a status of DriverWait. There reported error is:

Error Running Command 'zpool list ze39f27c-ostpool-0', StdErr: cannot open 'ze39f27c-ostpool-0': no such pool

The issue here is that zfs pools are not automatically imported on boot. This means they won't show up in zpool list and thus cannot be destroyed. The pool needs to be imported with zpool import first, then zpool destroy can be run. Manually importing the pool on the rabbit node allowed the workflow to progress through Teardown. It seems like importing the pools should be part of recovery.

# Show all importable pools
zpool import
   pool: ze39f27c-ostpool-0
     id: 4274207450364223765
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

    ze39f27c-ostpool-0                       ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62R0A00JTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A01FTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62R0A00TTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A003TUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62R0A00YTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62R0A00KTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62R0A00VTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A014TUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A01BTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62R0A00WTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A01DTUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A013TUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A01ETUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A010TUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62S0A002TUG7  ONLINE
      nvme-KIOXIA_KCD71RJE1T92_62P0A011TUG7  ONLINE

# Import the pool ze39f27c-ostpool-0, it now appears in `zpool list`.
zpool import ze39f27c-ostpool-0

behlendorf commented 1 year ago

Here's a little more information. After encountering this again we noticed that not only was the original work blocked, but any subsequent workflow using this same rabbit node would also block in Setup. Those follow up workflows would not progress until the one in Teardown completed.

behlendorf commented 1 year ago

A bit more detail. Manually importing the pool and restart the pod allowed recovery to complete successfully.

zpool import <pool name>

Optionally zpool import without any arguments will display the list of zpool detected on the attached drives which can be imported. zpool import -a will attempt to import all available zpools.

behlendorf commented 4 months ago

We bumped in to this one today. In this case the pool was never created due to an error at create time. After canceling the workflow it get hung in teardown repeatedly trying to destroy the pool which doesn't exist.

2024-02-07T17:38:45.697-0800    INFO    controllers.NnfNodeStorage    Recoverable Error
{"NnfNodeStorage": {"name":"default-fluxjob-755083942998050816-0-mdt-0","namespace":"tioga102"},
"Severity": "Major", "Message": "internal error: could not destroy block devices: could not destroy zpool zdfaefc5-mdt-0"}

Creating a dummy pool with the expected name and restarting the pod doesn't seem to have worked this time.

bdevcich commented 4 months ago

We bumped in to this one today. In this case the pool was never created due to an error at create time. After canceling the workflow it get hung in teardown repeatedly trying to destroy the pool which doesn't exist.
2024-02-07T17:38:45.697-0800    INFO    controllers.NnfNodeStorage    Recoverable Error
{"NnfNodeStorage": {"name":"default-fluxjob-755083942998050816-0-mdt-0","namespace":"tioga102"},
"Severity": "Major", "Message": "internal error: could not destroy block devices: could not destroy zpool zdfaefc5-mdt-0"}      
Creating a dummy pool with the expected name and restarting the pod doesn't seem to have worked this time.

This part of it is fixed via https://github.com/NearNodeFlash/nnf-sos/pull/259 and is included in the v0.0.8 release.

bdevcich commented 4 months ago

A bit more detail. Manually importing the pool and restart the pod allowed recovery to complete successfully.
zpool import <pool name>
Optionally zpool import without any arguments will display the list of zpool detected on the attached drives which can be imported. zpool import -a will attempt to import all available zpools.

I plan to put in a zpool import -a on the startup of the node controller to fix the other part of this.

bdevcich commented 3 months ago

I plan to put in a zpool import -a on the startup of the node controller to fix the other part of this.

Fixed via:

NearNodeFlash / NearNodeFlash.github.io

Teardown blocked after system crash #100