Closed tobg closed 1 year ago
I can see the keys in etcd:
$ etcdctl get / --prefix --keys-only | grep NODE_STOR_POOL /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:DFLTDISKLESSSTORPOOL/DRIVER_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:DFLTDISKLESSSTORPOOL/EXTERNAL_LOCKING /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:DFLTDISKLESSSTORPOOL/FREE_SPACE_MGR_DSP_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:DFLTDISKLESSSTORPOOL/FREE_SPACE_MGR_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:DFLTDISKLESSSTORPOOL/UUID /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-ELK-HA/DRIVER_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-ELK-HA/EXTERNAL_LOCKING /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-ELK-HA/FREE_SPACE_MGR_DSP_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-ELK-HA/FREE_SPACE_MGR_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-ELK-HA/UUID /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-HA/DRIVER_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-HA/EXTERNAL_LOCKING /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-HA/FREE_SPACE_MGR_DSP_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-HA/FREE_SPACE_MGR_NAME /LINSTOR/NODE_STOR_POOL/A2.TOBG.SERVICES:LVM-MAX-HA/UUID
11:38:36.592 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database is in progress 11:39:01.041 [Main] ERROR LINSTOR/Controller - SYSTEM - Database entry of table NODE_STOR_POOL could not be restored. [Report number 64218037-00000-000000]
11:39:01.055 [Main] ERROR LINSTOR/Controller - SYSTEM - Unhandled exception [Report number 64218037-00000-000001]
11:39:01.056 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutdown in progress
Database entry of table [...] could not be restored
is a quite generic error. Either some data are missing or some values could not be parsed properly (might be an invalid name, or even an unnoticed NullPointer).
The two mentioned error reports would be interesting. Would you mind sharing those?
linstor err show 64218037-00000-000000
linstor err show 64218037-00000-000001
Also, if you want you can send me a dump of your ETCD via email (find my email address in my profile please), than I can take a closer look (and hopefully reproduce the issue). Please make sure to also include the values :)
Hi, thank you. I very much appreciate it!
As the controller pod crashes, I'm unsure how to get the error report. The error above is with linstor 1.21.1. With earlier linstor versions, I got a NullPointer.
The etcd dump is 1.7gb. I think this is way too large. I upload it to a cloud service and email you the link.
Ok. First off - next time please zip the etcd dump, that would have saved us 1.5 gb :)
Next. Did you by chance tried to delete storage pools lvm-max-elk-ha
, lvm-max-ha
and lvm-max-svc02
on any node(s) with Linstor 1.20.1 or 1.20.2?
There was a bug where Linstor by accident deleted a bit too much during linstor sp d ...
. The bug is fixed since 1.20.3, but the bug was only detectable when restarting the controller.
I got your ETCD-dump working again by restoring the missing keys (fortunately those were very trivial keys to restore):
(the UUIDs are randomized, not restored)
ETCDCTL_API=3 etcdctl put "/LINSTOR/STOR_POOL_DEFINITIONS/LVM-MAX-ELK-HA/POOL_DSP_NAME" "lvm-max-elk-ha"
ETCDCTL_API=3 etcdctl put "/LINSTOR/STOR_POOL_DEFINITIONS/LVM-MAX-ELK-HA/UUID" 0d43fef7-b537-4632-92f5-9ccc04721059
ETCDCTL_API=3 etcdctl put "/LINSTOR/STOR_POOL_DEFINITIONS/LVM-MAX-HA/POOL_DSP_NAME" "lvm-max-ha"
ETCDCTL_API=3 etcdctl put "/LINSTOR/STOR_POOL_DEFINITIONS/LVM-MAX-HA/UUID" aae62608-cc1a-4f36-82db-ff8029574324
ETCDCTL_API=3 etcdctl put "/LINSTOR/STOR_POOL_DEFINITIONS/LVM-MAX-SVC02/POOL_DSP_NAME" "lvm-max-svc02"
ETCDCTL_API=3 etcdctl put "/LINSTOR/STOR_POOL_DEFINITIONS/LVM-MAX-SVC02/UUID" 636eeb0e-a8ad-4860-b571-bcbe14da8b9b
Let me know if that fixes the issue for you.
Thank you very much! Issue solved.
Sorry for not creating a zip. :)
I only was able to use 1.20.0 and never used 1.20.1 or 1.20.2.
Please tell us how you were able to find the missing entries. I set up a management pod with the linstor controller in debug mode and tried to get the missing keys, but I could not find any hint. Also, the error report didn't help.
Since I did not know what version you ran, I started your DB with the current version, which includes a quite recent patch that includes a bit more details in case of a DB loading exception, or to be more precise, includes lines like this:
Error message: Database entry of table NODE_STOR_POOL could not be restored.
ErrorContext: Details: Primary key: NODE_NAME = 'K8W50.TOBG.SERVICES', POOL_NAME = 'LVM-MAX-SVC02'
Additionally the line-number from the stacktrace (of the same ErrorReport) told me that either the node or the snapshotDefintion was null (i.e. was not properly loaded). It's not likely that the node is null, since that should have caused already issues loading tables earlier. So with "it might be the storage pool definition" in mind, I simply started looking for keys in ETCD using something like:
ETCDCTL_API=3 etcdctl get --prefix "/LINSTOR/STOR_POOL_DEFINITIONS/"
And since the POOL_NAME mentioned in the error report was nowhere to find, I just gave it a try and created it (as mentioned, I was lucky that these were very trivial entries to restore, where I could literally randomize the UUID, which is not always the case, and also usually entries need more Key/Value pairs to properly work)... Which worked... until the next POOL_NAME caused the same issue. But repeating this step 2 more times actually solved the issue...
Happy to hear that your issue has been solved :)
Hello, I'm using piraeus operator. The linstor controller cannot load the table of NODE_STOR_POOL
11:38:36.592 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database is in progress 11:39:01.041 [Main] ERROR LINSTOR/Controller - SYSTEM - Database entry of table NODE_STOR_POOL could not be restored. [Report number 64218037-00000-000000]
11:39:01.055 [Main] ERROR LINSTOR/Controller - SYSTEM - Unhandled exception [Report number 64218037-00000-000001]
11:39:01.056 [Thread-2] INFO LINSTOR/Controller - SYSTEM - Shutdown in progress