Closed duckhawk closed 1 year ago
@ghernadi sorry for mention, Andrew Kvapil (kvaps) said, that you can be interested in this problem
Looks like problem is because controller restart due snapshot creating. It creates rollback.internal, then restarted, and after restart I have crashloopbackoff permanently. After applying rollback.internal - DB state goes to normal. Is it possible to add rollback apply if it exists on linstor-controller start?
Sorry for the delay, but it took me some time to reproduce this issue.
You mentioned that you enabled backup. Does that also include restoring from a backup once? Because that is the only way I was able to reproduce this issue...
A bit background info: At the beginning of a transaction Linstor creates a rollback object (which you apparently found). That rollback object contains the "to be changed" instances of other objects - including their UIDs from K8s. If anything goes wrong during the commit phase of the transaction, the corresponding k8s instances entries are replaced with the data from rollback entries. In theory that works well.
However, during my database-tests I like to get the database in a specific state, create a backup from that state and clean + restore the database from that backup for every test run to verify my fix is really dealing with the issue properly. The point is, that this "clear + restore" step is implemented by my script in a way to "delete all k8s entries and (re-) apply the entries from my backup". This last re-applying the data causes however K8s to generate new UIDs for the objects.
When Linstor then tries to rollback the instances, the entries from the rollback refer to the old UID, causing the mismatch of the shown error.
The reason I went this much into detail here is to make sure that you also had a similar situation. If not, I'd still be interested in how those UID could have changed.
Just to be clear, I am only asking if you executed a restore - not that some error had occured that made restoring necessary. It is quite possible that you created the backup during such a commit phase, which succeeded in the end, but starting the controller with such a restored database would have caused a rollback, also triggering this issue.
If I am right and you did restore the database similar than I did, I have a fix ready for the next release.
Yes, looks like problem is solved in 1.24, thank you
Some time after enabling backup (I'm not sure that it cause problem, but only action was enabling backups) linstor controller stopped, and now can't start at all.
In controller logs there is database initialization error
In error report strange problem with UIDs