[Feature] Backup-restore should take care of releasing the lock over etcd db file which stops new etcd process to come up

Feature (What you would like to be added):

It has been observed that some times due to abruptly termination of etcd container the lock on the etcd db file was still held by previous etcd process. And this stop new etcd process to come up as new etcd process unable to access db file as it is unable to open bolt db.
And In other case, abnormal termination of etcd container leads to the database directory lock not being released and prevents the backup-restore to hang while opening the database for verification on etcd container restart.

In both of the cases, Backup-restore should detect this intervene and take care of releasing the lock over bolt db. This can also avoid manual intervention.

Motivation (Why is this needed?): It has been observed that some times due to abruptly crash/shutdown of etcd container the lock on the etcd db file was still held by previous etcd process. Logs of etcd:

etcdserver: another etcd process is using "/var/etcd/data/new.etcd/member/snap/db" and holds the file lock, or loading backend file is taking >10 seconds

Till now this scenario is very rarely observed but in multi-node etcd-backup-restore the chances of this kind of scenarios can also be increased.

Approach/Hint to the implement solution (optional): A pod restart resolves the issue.

This issue had been observed again when @seshachalam-yv was creating a load in multi-node etcd zonal cluster to test it. Initially all 3 member of etcd cluster went into this issue but then @seshachalam-yv informed me that 2 out of 3 pods member recovered without doing anything.

etcd-main-0                                   1/2     Running            0          7m2s
etcd-main-1                                   2/2     Running            0          29m
etcd-main-2                                   2/2     Running            0          29m

But etcd-main-0 didn’t recovered. Logs of etcd container of etcd-main-0 pod:

2022-06-06 13:03:22.806358 W | etcdserver: another etcd process is using "/var/etcd/data/new.etcd/member/snap/db" and holds the file lock, or loading backend file is taking >10 seconds
2022-06-06 13:03:22.806378 W | etcdserver: waiting for it to exit before starting...

More Importantly restarting of pod didn’t help.

$etcdctl endpoint status --cluster -w table

+------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
|                                ENDPOINT                                |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX |             ERRORS             |
+------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+
| https://etcd-main-2.etcd-main-peer.shoot--ash-garden--mz-etcd.svc:2379/ | 5340dd534468d4c4 |  3.4.13 |  9.0 GB |     false |      false |       652 |     187943 |             187943 |  memberID:16771230989201417918 |
|                                                                        |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                                                                        |                  |         |         |           |            |           |            |                    |   memberID:5999038053357245636 |
|                                                                        |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                                                                        |                  |         |         |           |            |           |            |                    |   memberID:2016129565828661839 |
|                                                                        |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
| https://etcd-main-1.etcd-main-peer.shoot--ash-garden--mz-etcd.svc:2379/ | e8bf61ba155fa6be |  3.4.13 |  9.1 GB |      true |      false |       652 |     187943 |             187943 |  memberID:16771230989201417918 |
|                                                                        |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                                                                        |                  |         |         |           |            |           |            |                    |   memberID:5999038053357245636 |
|                                                                        |                  |         |         |           |            |           |            |                    |                alarm:NOSPACE , |
|                                                                        |                  |         |         |           |            |           |            |                    |   memberID:2016129565828661839 |
|                                                                        |                  |         |         |           |            |           |            |                    |                 alarm:NOSPACE  |
+------------------------------------------------------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------------------------------+

And I observed that as he was testing the load on multi-node etcd the DB size of etcd had grown up to ~9GB, and we usually have limit of 8Gi(source). So, I suggested him is to call defragmentation on each etcd cluster member: etcdctl defrag --endpoints=... It succeed for etcd-main-1, etcd-main-2 but it failed for etcd-main-0 as it was already stuck in waiting. At the end I suggested him to delete the member dir of etcd-main-0 and scale down etcd stateful set to 0 and scale up to 3. This solved the issue.

So, I guess the issue occurred due to large db size(which was increased beyond quota limit) and boltdb might had some difficulty in opening the large db file but I don’t have concrete evidence to prove this hypothesis.

gardener / etcd-backup-restore

[Feature] Backup-restore should take care of releasing the lock over etcd db file which stops new etcd process to come up #469