K8s is running within Azure.
We have set up a 3 node etcd cluster and set 3 backups (hourly, daily, weekly) with backup directly to Azure blob storage.
What is observed:
Looking at the backup history in the Azure there are gaps in the backup cycle. These gaps are mostly visible with longer backup cycles.
When looked at etcd-backup-operator pod logs there are multiple restart events within timeframe of the missing backups. If I correctly understood restarts were happening due to etcd leader election or something like that.
To validate my suspicions I have set the following script to kill the backup operator pod and later only the container and set it via Cron to happen every 10 minutes. I have set the backup every 20 minutes. As a result backup was never done since 04:39 UTC time, when I started to experiment. Well after 6 restarts pod got into Error state. I will try to continue with less aggressive restart cron schedule to see if that has impact.
Expected result:
Backup is happening according to the schedule regardless of container restarts. Schedule timer should not be linked to container lifetime as container may die any time. Or is it a feature due to the way Kubernetes works?
Script:
#!/bin/bash
cd /root
date +"%Y %m %d - %H:%M" 2>&1 >> kill-operator.log
/usr/local/bin/kubectl -n tep-k8s-test-01 exec -c etcd-backup-operator $(/usr/local/bin/kubectl -n tep-k8s-test-01 get po -l name=etcd-backup-operator -o name) -- /bin/kill -5 1 2>&1 >> kill-operator.log
echo "----" 2>&1 >> kill-operator.log
Environment:
K8s is running within Azure. We have set up a 3 node etcd cluster and set 3 backups (hourly, daily, weekly) with backup directly to Azure blob storage.
What is observed: Looking at the backup history in the Azure there are gaps in the backup cycle. These gaps are mostly visible with longer backup cycles.
When looked at etcd-backup-operator pod logs there are multiple restart events within timeframe of the missing backups. If I correctly understood restarts were happening due to etcd leader election or something like that.
To validate my suspicions I have set the following script to kill the backup operator pod and later only the container and set it via Cron to happen every 10 minutes. I have set the backup every 20 minutes. As a result backup was never done since 04:39 UTC time, when I started to experiment. Well after 6 restarts pod got into Error state. I will try to continue with less aggressive restart cron schedule to see if that has impact.
Expected result:
Backup is happening according to the schedule regardless of container restarts. Schedule timer should not be linked to container lifetime as container may die any time. Or is it a feature due to the way Kubernetes works?
Script:
Edited backup schedule: