Describe the bug:
It has been observed in one of our production cluster that when etcd's data-dir got removed somehow, backup-restore failed to detect this as a single member restoration scenario for a etcd pod when snapstore is not configured and backup-restore falsely detect this case as bootstrap case. This leads to etcd-events-0 pod not starting up as it failed to join the cluster due to memberID mismatch.
❯ k get pods etcd-events-0
etcd-events-0 1/2 Running 0 2m14s
How To Reproduce (as minimally and precisely as possible):
Start a 3 member etcd cluster when snapstore is not configured.
Start a debug container to etcd-0 pod then remove the data-dir completely.
Kill the etcd container to restart/trigger the restoration.
Logs:
backup-restore logs of etcd-events-0 pod:
2024-08-07 23:59:36 | {"log":"Served config for ETCD instance.","severity":"INFO"}
2024-08-07 23:59:36 | {"log":"checking the presence of a learner in a cluster...","severity":"INFO"}
2024-08-07 23:59:35 | {"log":{"attempt":0,"caller":"clientv3/retry_interceptor.go:62","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\"","level":"warn","msg":"retrying of unary invoker failed","target":"passthrough:///https://etcd-events-local:2379","ts":"2024-08-07T23:59:35.845Z"}}
2024-08-07 23:59:35 | {"log":"failed to get status of etcd endPoint: https://etcd-events-local:2379 with error: context deadline exceeded","severity":"ERR"}
2024-08-07 23:59:35 | {"log":"Updating status from Successful to New","severity":"INFO"}
2024-08-07 23:59:35 | {"log":"Responding to status request with: Successful","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Successfully initialized data directory for etcd.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Removing directory(/var/etcd/data/new.etcd) since snapstore is empty.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"storage provider name not specified","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Checking whether the backup bucket is empty or not...","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Validation mode: full","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Validation failBelowRevision: ","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Setting status to : 503","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Updating status from New to Progress","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Received start initialization request.","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"Responding to status request with: New","severity":"INFO"}
2024-08-07 23:59:34 | {"log":"No snapstore storage provider configured.","severity":"WARN"}
2024-08-07 23:59:33 | {"log":"TLS enabled. Starting HTTPS server.","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Starting HTTP server at addr: :8080","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Checking if etcd is running","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Starting the http server...","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Registering the http request handlers...","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Setting status to : 503","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"compressionConfig:\\n enabled: true\\n policy: gzip\\ndefragmentationSchedule: 17 1 */3 * *\\netcdConnectionConfig:\\n caFile: /var/etcd/ssl/client/ca/bundle.crt\\n certFile: /var/etcd/ssl/client/client/tls.crt\\n connectionTimeout: 5m0s\\n defragTimeout: 15m0s\\n endpoints:\\n - https://etcd-events-local:2379\\n keyFile: /var/etcd/ssl/client/client/tls.key\\n serviceEndpoints:\\n - https://etcd-events-client:2379\\n snapshotTimeout: 15m0s\\nexponentialBackoffConfig:\\n attemptLimit: 6\\n multiplier: 2\\n thresholdTime: 2m8s\\nhealthConfig:\\n deltaSnapshotLeaseName: delta-snapshot-revisions\\n fullSnapshotLeaseName: full-snapshot-revisions\\n heartbeatDuration: 10s\\n memberGCDuration: 1m0s\\n memberLeaseRenewalEnabled: true\\nleaderElectionConfig:\\n etcdConnectionTimeout: 5s\\n reelectionPeriod: 5s\\nrestorationConfig:\\n MaxRequestBytes: 10485760\\n MaxTxnOps: 10240\\n autoCompactionMode: periodic\\n autoCompactionRetention: 30m\\n dataDir: /var/etcd/data/new.etcd\\n embeddedEtcdQuotaBytes: 8589934592\\n initialAdvertisePeerURLs:\\n - http://localhost:2380\\n initialCluster: default=http://localhost:2380\\n initialClusterToken: etcd-cluster\\n maxCallSendMsgSize: 10485760\\n maxFetchers: 6\\n name: default\\n tempDir: /var/etcd/data/restoration.temp\\nserverConfig:\\n port: 8080\\n server-cert: /var/etcd/ssl/client/server/tls.crt\\n server-key: /var/etcd/ssl/client/server/tls.key\\nsnapshotterConfig:\\n deltaSnapshotMemoryLimit: 104857600\\n deltaSnapshotPeriod: 20s\\n deltaSnapshotRetentionPeriod: 0s\\n garbageCollectionPeriod: 12h0m0s\\n garbageCollectionPolicy: Exponential\\n maxBackups: 7\\n schedule: 0 */1 * * *\\nsnapstoreConfig:\\n container: \\\"\\\"\\n maxParallelChunkUploads: 5\\n minChunkSize: 5242880\\n prefix: v2\\n tempDir: /var/etcd/data/temp\\n","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Go OS/Arch: linux/amd64","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Go Version: go1.20.3","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"Git SHA: 6a8f2198","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"etcd-backup-restore Version: v0.28.2","severity":"INFO"}
2024-08-07 23:59:33 | {"log":"No snapstore storage provider configured. Will not start backup schedule.","severity":"WARN"}
2024-08-07 23:17:38 | {"log":"HTTPS server closed gracefully.","severity":"INFO"}
2024-08-07 23:17:38 | {"log":"Shutting down LeaderElection...","severity":"INFO"}
Screenshots (if applicable):
Environment (please complete the following information):
Etcd version/commit ID :
Etcd-backup-restore version/commit ID:
Cloud Provider [All/AWS/GCS/ABS/Swift/OSS]:
Anything else we need to know?:
This issue can only be occur for 0th pod.
Describe the bug: It has been observed in one of our production cluster that when etcd's data-dir got removed somehow, backup-restore failed to detect this as a single member restoration scenario for a etcd pod when snapstore is not configured and backup-restore falsely detect this case as
bootstrap
case. This leads toetcd-events-0
pod not starting up as it failed to join the cluster due to memberID mismatch.How To Reproduce (as minimally and precisely as possible):
etcd-0
pod then remove the data-dir completely.Logs: backup-restore logs of
etcd-events-0
pod:Screenshots (if applicable):
Environment (please complete the following information):
Anything else we need to know?: This issue can only be occur for
0th
pod.