SSHOC / sshoc-marketplace-backend

Code for the backend
Apache License 2.0
2 stars 0 forks source link

Lock on database #383

Closed KlausIllmayer closed 11 months ago

KlausIllmayer commented 1 year ago

Every now and then when running the deployment workflow, it will not deploy the API backend due to a lock in the postgresql-database raised by Liquibase. It has the lock in the table public.databasechangeloglock. I can solve the issue manually (see https://stackoverflow.com/a/19081612). But this is not ideal, as it means that without manual intervention the deployment will not work again. Even stopping the API backend container will not lift up the lock. I wonder, how the unlocking is handled in the backend? Shouldn't it release the lock when the backend is stopped? As it sometimes works without problems, it seems to me that there are situations, where the lock still remains. The deployment workflow on Kubernetes does this: it will stop the pod running the API backend with the old code and in parallel creates a new pod with the new code. I see in the logs of the new pod that it checks for a lock and waits, that this lock is gone, but sometimes this will not happen. Kubernetes then deletes the new pod and restores the old pod, thus not updating the code. @tparkola Can you have a look if the current unlock mechanism is maybe under-specified and could be elaborated for more scenarios?

KlausIllmayer commented 1 year ago

Just to add: it could be also, that our setup is not that perfect, as we would need it. If you think, that we should test again our workflow, as the backend does it best to prevent such locks, this would be also helpful feedback.

KlausIllmayer commented 1 year ago

Observed it again: it seems that the lock in databasechangeloglock is only produced in special situations. Maybe you can point me to these situations, so that I better understand, when it happens. This would help debugging our workflow. I found out, because the lock that I deleted when it went wrong was already 2 days old and I wonder why and how such a lock was created but not unlocked.

KlausIllmayer commented 1 year ago

Raised to critical as it can happen in running operation. Every now and then K8s containers may restart due to external effects, e.g. update of K8s cluster itself. In theory, users should not be bothered by it, because it will create the new container in parallel and switch to this one before retiring the old one. But this could be also the cause of the lock. I now observed it two times that such a restart lead to the database lock - this means that such a lock does not happen every time - of production. Only after manual intervention we were able to unlock the database (for documentation here the manual database command):

UPDATE databasechangeloglock SET LOCKED=FALSE, LOCKGRANTED=null, LOCKEDBY=null where ID=1;
KlausIllmayer commented 1 year ago

We currently try out to raise the values for the ReadinessProbe in the K8s pod config. Another option is described here: https://www.liquibase.com/blog/using-liquibase-in-kubernetes

KlausIllmayer commented 11 months ago

Did some deploys and the error does not occur anymore. Seems, that it is now solved. If not, we will re-open the issue.