Open behlendorf opened 1 year ago
Here's a little more information. After encountering this again we noticed that not only was the original work blocked, but any subsequent workflow using this same rabbit node would also block in Setup. Those follow up workflows would not progress until the one in Teardown completed.
A bit more detail. Manually importing the pool and restart the pod allowed recovery to complete successfully.
zpool import <pool name>
Optionally zpool import
without any arguments will display the list of zpool detected on the attached drives which can be imported. zpool import -a
will attempt to import all available zpools.
We bumped in to this one today. In this case the pool was never created due to an error at create time. After canceling the workflow it get hung in teardown repeatedly trying to destroy the pool which doesn't exist.
2024-02-07T17:38:45.697-0800 INFO controllers.NnfNodeStorage Recoverable Error
{"NnfNodeStorage": {"name":"default-fluxjob-755083942998050816-0-mdt-0","namespace":"tioga102"},
"Severity": "Major", "Message": "internal error: could not destroy block devices: could not destroy zpool zdfaefc5-mdt-0"}
Creating a dummy pool with the expected name and restarting the pod doesn't seem to have worked this time.
We bumped in to this one today. In this case the pool was never created due to an error at create time. After canceling the workflow it get hung in teardown repeatedly trying to destroy the pool which doesn't exist.
2024-02-07T17:38:45.697-0800 INFO controllers.NnfNodeStorage Recoverable Error {"NnfNodeStorage": {"name":"default-fluxjob-755083942998050816-0-mdt-0","namespace":"tioga102"}, "Severity": "Major", "Message": "internal error: could not destroy block devices: could not destroy zpool zdfaefc5-mdt-0"}
Creating a dummy pool with the expected name and restarting the pod doesn't seem to have worked this time.
This part of it is fixed via https://github.com/NearNodeFlash/nnf-sos/pull/259 and is included in the v0.0.8 release.
A bit more detail. Manually importing the pool and restart the pod allowed recovery to complete successfully.
zpool import <pool name>
Optionally
zpool import
without any arguments will display the list of zpool detected on the attached drives which can be imported.zpool import -a
will attempt to import all available zpools.
I plan to put in a zpool import -a
on the startup of the node controller to fix the other part of this.
I plan to put in a zpool import -a on the startup of the node controller to fix the other part of this.
Fixed via:
While an ephemeral Lustre workflow was allocated the rabbit-p node encountered a kernel panic (EDAC PCIe parity error). After the node was rebooted and all of the container services started one workflow was left stuck in Teardown with a status of DriverWait. There reported error is:
The issue here is that zfs pools are not automatically imported on boot. This means they won't show up in
zpool list
and thus cannot be destroyed. The pool needs to be imported withzpool import
first, thenzpool destroy
can be run. Manually importing the pool on the rabbit node allowed the workflow to progress through Teardown. It seems like importing the pools should be part of recovery.