coreos / etcd-operator

etcd operator creates/configures/manages etcd clusters atop Kubernetes
https://coreos.com/blog/introducing-the-etcd-operator.html
Apache License 2.0
1.75k stars 740 forks source link

Etcd backup operator seem to miss schedule if operator pod/container is restarted #2152

Open elvinasp opened 4 years ago

elvinasp commented 4 years ago

Environment:

K8s is running within Azure. We have set up a 3 node etcd cluster and set 3 backups (hourly, daily, weekly) with backup directly to Azure blob storage.

What is observed: Looking at the backup history in the Azure there are gaps in the backup cycle. These gaps are mostly visible with longer backup cycles.

When looked at etcd-backup-operator pod logs there are multiple restart events within timeframe of the missing backups. If I correctly understood restarts were happening due to etcd leader election or something like that.

To validate my suspicions I have set the following script to kill the backup operator pod and later only the container and set it via Cron to happen every 10 minutes. I have set the backup every 20 minutes. As a result backup was never done since 04:39 UTC time, when I started to experiment. Well after 6 restarts pod got into Error state. I will try to continue with less aggressive restart cron schedule to see if that has impact.

Expected result:

Backup is happening according to the schedule regardless of container restarts. Schedule timer should not be linked to container lifetime as container may die any time. Or is it a feature due to the way Kubernetes works?

Script:

#!/bin/bash

cd /root
date +"%Y %m %d - %H:%M" 2>&1 >> kill-operator.log
/usr/local/bin/kubectl -n tep-k8s-test-01 exec -c etcd-backup-operator  $(/usr/local/bin/kubectl -n tep-k8s-test-01 get po -l  name=etcd-backup-operator -o name) -- /bin/kill -5 1  2>&1  >>  kill-operator.log
echo "----" 2>&1 >>  kill-operator.log

Edited backup schedule:

root@atl-cj1-m-ducx:~# kubectl  -n tep-k8s-test-01 describe  EtcdBackup etcd-cluster-backup-weekly
Name:         etcd-cluster-backup-weekly
Namespace:    tep-k8s-test-01
Labels:       <none>
Annotations:  <none>
API Version:  etcd.database.coreos.com/v1beta2
Kind:         EtcdBackup
Metadata:
  Creation Timestamp:  2020-01-15T07:54:50Z
  Finalizers:
    backup-operator-periodic
  Generation:        145
  Resource Version:  81580419
  Self Link:         /apis/etcd.database.coreos.com/v1beta2/namespaces/tep-k8s-test-01/etcdbackups/etcd-cluster-backup-weekly
  UID:               7dd4c2a7-e1e0-4fe1-ae04-100be7ff6d65
Spec:
  Abs:
    Abs Secret:  storage-account-credentials-weekly
    Path:        tep-k8s-test-01/etcd.backup
  Backup Policy:
    Backup Interval In Second:  1200
  Etcd Endpoints:
    http://etcd-cluster-client:2379
  Storage Type:  ABS
Status:
  Etcd Revision:      1098811
  Etcd Version:       3.4.3
  Last Success Date:  2020-01-27T04:39:09Z
  Succeeded:          true
Events:               <none>
root@atl-cj1-m-ducx:~# date
Mon Jan 27 09:05:37 UTC 2020