gardener / etcd-backup-restore

Collection of components to backup and restore the etcd of a Kubernetes cluster.
Apache License 2.0
284 stars 99 forks source link

[BUG] Backup-restore falsely detects a single member restoration as bootstrap case when snapstore is not configured in etcd cluster. #760

Closed ishan16696 closed 3 weeks ago

ishan16696 commented 1 month ago

Describe the bug: It has been observed in one of our production cluster that when etcd's data-dir got removed somehow, backup-restore failed to detect this as a single member restoration scenario for a etcd pod when snapstore is not configured and backup-restore falsely detect this case as bootstrap case. This leads to etcd-events-0 pod not starting up as it failed to join the cluster due to memberID mismatch.

❯ k get pods etcd-events-0
etcd-events-0                                          1/2     Running   0             2m14s  

How To Reproduce (as minimally and precisely as possible):

  1. Start a 3 member etcd cluster when snapstore is not configured.
  2. Start a debug container to etcd-0 pod then remove the data-dir completely.
  3. Kill the etcd container to restart/trigger the restoration.

Logs: backup-restore logs of etcd-events-0 pod:

2024-08-07 23:59:36 | {"log":"Served config for ETCD instance.","severity":"INFO"}
2024-08-07 23:59:36 | {"log":"checking the presence of a learner in a cluster...","severity":"INFO"}
2024-08-07 23:59:35 | {"log":{"attempt":0,"caller":"clientv3/retry_interceptor.go:62","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\"","level":"warn","msg":"retrying of unary invoker failed","target":"passthrough:///https://etcd-events-local:2379","ts":"2024-08-07T23:59:35.845Z"}}
2024-08-07 23:59:35 | {"log":"failed to get status of etcd endPoint: https://etcd-events-local:2379 with error: context deadline exceeded","severity":"ERR"}
2024-08-07 23:59:35 | {"log":"Updating status from Successful to New","severity":"INFO"}
2024-08-07 23:59:35 | {"log":"Responding to status request with: Successful","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Successfully initialized data directory for etcd.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Removing directory(/var/etcd/data/new.etcd) since snapstore is empty.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"storage provider name not specified","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Checking whether the backup bucket is empty or not...","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Validation mode: full","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Validation failBelowRevision: ","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Setting status to : 503","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Updating status from New to Progress","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Received start initialization request.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Responding to status request with: New","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"No snapstore storage provider configured.","severity":"WARN"}
2024-08-07 23:59:33 | {"log":"TLS enabled. Starting HTTPS server.","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Starting HTTP server at addr: :8080","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Checking if etcd is running","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Starting the http server...","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Registering the http request handlers...","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Setting status to : 503","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"compressionConfig:\\n  enabled: true\\n  policy: gzip\\ndefragmentationSchedule: 17 1 */3 * *\\netcdConnectionConfig:\\n  caFile: /var/etcd/ssl/client/ca/bundle.crt\\n  certFile: /var/etcd/ssl/client/client/tls.crt\\n  connectionTimeout: 5m0s\\n  defragTimeout: 15m0s\\n  endpoints:\\n  - https://etcd-events-local:2379\\n  keyFile: /var/etcd/ssl/client/client/tls.key\\n  serviceEndpoints:\\n  - https://etcd-events-client:2379\\n  snapshotTimeout: 15m0s\\nexponentialBackoffConfig:\\n  attemptLimit: 6\\n  multiplier: 2\\n  thresholdTime: 2m8s\\nhealthConfig:\\n  deltaSnapshotLeaseName: delta-snapshot-revisions\\n  fullSnapshotLeaseName: full-snapshot-revisions\\n  heartbeatDuration: 10s\\n  memberGCDuration: 1m0s\\n  memberLeaseRenewalEnabled: true\\nleaderElectionConfig:\\n  etcdConnectionTimeout: 5s\\n  reelectionPeriod: 5s\\nrestorationConfig:\\n  MaxRequestBytes: 10485760\\n  MaxTxnOps: 10240\\n  autoCompactionMode: periodic\\n  autoCompactionRetention: 30m\\n  dataDir: /var/etcd/data/new.etcd\\n  embeddedEtcdQuotaBytes: 8589934592\\n  initialAdvertisePeerURLs:\\n  - http://localhost:2380\\n  initialCluster: default=http://localhost:2380\\n  initialClusterToken: etcd-cluster\\n  maxCallSendMsgSize: 10485760\\n  maxFetchers: 6\\n  name: default\\n  tempDir: /var/etcd/data/restoration.temp\\nserverConfig:\\n  port: 8080\\n  server-cert: /var/etcd/ssl/client/server/tls.crt\\n  server-key: /var/etcd/ssl/client/server/tls.key\\nsnapshotterConfig:\\n  deltaSnapshotMemoryLimit: 104857600\\n  deltaSnapshotPeriod: 20s\\n  deltaSnapshotRetentionPeriod: 0s\\n  garbageCollectionPeriod: 12h0m0s\\n  garbageCollectionPolicy: Exponential\\n  maxBackups: 7\\n  schedule: 0 */1 * * *\\nsnapstoreConfig:\\n  container: \\\"\\\"\\n  maxParallelChunkUploads: 5\\n  minChunkSize: 5242880\\n  prefix: v2\\n  tempDir: /var/etcd/data/temp\\n","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Go OS/Arch: linux/amd64","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Go Version: go1.20.3","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Git SHA: 6a8f2198","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"etcd-backup-restore Version: v0.28.2","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"No snapstore storage provider configured. Will not start backup schedule.","severity":"WARN"}
2024-08-07 23:17:38 | {"log":"HTTPS server closed gracefully.","severity":"INFO"}
2024-08-07 23:17:38 | {"log":"Shutting down LeaderElection...","severity":"INFO"}

Screenshots (if applicable):

Environment (please complete the following information):

Anything else we need to know?: This issue can only be occur for 0th pod.

ishan16696 commented 1 month ago

/assign