Closed chemicstry closed 1 year ago
It seems that the storage pool was deleted, but volumes/resources that referenced said storage pool were still hanging in kubernetes CRDs. After manually deleting everything that had references to the storage pool I was able to bring the controller back up.
So the issue here is two fold:
I believe this was fixed by 1.20.3
- Fixed storage pool definition delete not checking if there are still other storage pools
. There was a pending storage pool delete operation after changing piraeus-operator values, but it couldn't execute, because there were still volumes in the storage pool. However, when I deleted the 3-replica volume and there was only 1-replica volume left, the storage pool delete operation succeeded on nodes that were empty (due to a bug fixed in 1.20.3).
I will close this issue, but it would be nice to have some kind of recovery mode and not a total failure if something like this happens
- Linstor can't recover from corrupted database and crashes. I think this is a big one. I would expect controller to throw errors about corrupted database entries (i.e. resource with missing storage pool), but it should still start and load everything else that is valid.
The reason why it does not is that if a resource exists, but its entries are corrupted in LINSTOR's data, then ignoring the entries would cause more errors, e.g. because the unique minor number or node ID for DRBD resources would then be unknown to the controller, and creating new resources would reuse those numbers, causing a collision with the already in-use numbers. While such desyncs between stored data and in-memory data should not happen in the first place and require fixes in LINSTOR's code, one reason why there are no additional safeguards against corruption of the data in K8s CRD is that CRD is not a database. LINSTOR is able to use various databases, and those have many internal constraints that enforce consistency and prevent corruption, such as duplicate entries, dangling references, out of range values, or invalid data. CRD does not have these capabilities, and while it may be more convenient to use in combination with Kubernetes, the use of a real database results in much higher robustness against various sorts of corruption or inconsistency.
I've run into similar issues regarding corrupted state
LINSTOR is able to use various databases, and those have many internal constraints that enforce consistency and prevent corruption, such as duplicate entries, dangling references, out of range values, or invalid data. CRD does not have these capabilities, and while it may be more convenient to use in combination with Kubernetes, the use of a real database results in much higher robustness against various sorts of corruption or inconsistency.
Based on @raltnoeder reply. Is it recommend to use an etcd, mariadb or postgresql database for production setups?
Based on @raltnoeder reply. Is it recommend to use an etcd, mariadb or postgresql database for production setups?
PostgreSQL is the most capable and well-tested, it can even run modifications of the database's structure (like adding or removing columns from a table or changing constraints) in a transaction. etcd is functionally comparable to CRD, it's not a database.
This is a continuation of https://github.com/piraeusdatastore/piraeus-operator/issues/397.
Deleting resource definitions caused an internal state corruption with errors like
Access to deleted volume
. Most of the commands except few (node list
) do not work.My setup is 1 controller using k8s CRDs as storage with 3 satellites using the following piraeus-operator configuration:
Shell log leading up to the incident:
Error report:
Controller logs:
Unfortunatelly restarting the controller as suggested in https://github.com/piraeusdatastore/piraeus-operator/issues/397 did not fix the issue. First restart lead to a new error
Access to deleted storage pool
when callinglinstor r list
and second restart lead to a controller crash loop:I'm not sure how to check error reports when linstor server is in a crash loop. Are they stored in CRDs?
Let me know if there is any additional information that you need.