gardener / etcd-backup-restore

Collection of components to backup and restore the etcd of a Kubernetes cluster.
Apache License 2.0
286 stars 100 forks source link

[BUG] Error when i deploy with Helm Chart #784

Open Federico-Baldan opened 1 week ago

Federico-Baldan commented 1 week ago

Hi, when i deploy using the helm chart using this:

"etcdBackupRestore: repository: europe-docker.pkg.dev/gardener-project/releases/gardener/etcdbrctl tag: v0.30.1 pullPolicy: IfNotPresent"

it gives me this error: "Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "etcdbrctl": executable file not found in $PATH: unknown"

this is my values.yaml:

images:
  # etcd image to use
  etcd:
    repository: europe-docker.pkg.dev/gardener-project/public/gardener/etcd
    tag: v3.4.13-bootstrap-1
    pullPolicy: IfNotPresent
  # etcd-backup-restore image to use
  etcdBackupRestore:
    repository: europe-docker.pkg.dev/gardener-project/releases/gardener/etcdbrctl
    tag: v0.30.1
    pullPolicy: Always

resources:
  etcd:
    limits:
      cpu: 100m
      memory: 1Gi
    requests:
      cpu: 100m
      memory: 128Mi
  backup:
    limits:
      cpu: 100m
      memory: 1Gi
    requests:
      cpu: 23m
      memory: 128Mi

servicePorts:
  client: 2379
  server: 2380
  backupRestore: 8080

storageCapacity: 20Gi

# autoCompaction defines the specification to be used by Etcd as well as by embedded-Etcd of backup-restore sidecar during restoration.
# auto-compaction mode for etcd and embedded-Etcd: 'periodic' mode or 'revision' mode.
# auto-compaction retention length for etcd as well as for  embedded-Etcd of backup-restore sidecar.
autoCompaction:
  mode: periodic
  retentionLength: "30m"

backup:
  # schedule is cron standard schedule to take full snapshots.
  schedule: "0 */1 * * *"

  # deltaSnapshotPeriod is Period after which delta snapshot will be persisted. If this value is set to be lesser than 1 second, delta snapshotting will be disabled.
  deltaSnapshotPeriod: "60s"
  # deltaSnapshotMemoryLimit is memory limit in bytes after which delta snapshots will be taken out of schedule.
  deltaSnapshotMemoryLimit: 104857600 #100MB

  # defragmentationSchedule is schedule on which the etcd data will defragmented. Value should follow standard cron format.
  defragmentationSchedule: "0 0 */3 * *"

  # garbageCollectionPolicy mentions the policy for garbage collecting old backups. Allowed values are Exponential(default), LimitBased.
  garbageCollectionPolicy: Exponential
  # maxBackups is the maximum number of backups to keep (may change in future). This is honoured only in the case when garbageCollectionPolicy is set to LimitBased.
  maxBackups: 7
  # garbageCollectionPeriod is the time period after which old snapshots are periodically garbage-collected
  garbageCollectionPeriod: "1m"

  etcdConnectionTimeout: "30s"
  etcdSnapshotTimeout: "8m"
  etcdDefragTimeout: "8m"
  # etcdQuotaBytes used to Raise alarms when backend DB size exceeds the given quota bytes
  etcdQuotaBytes: 8589934592 #8GB

  # storageContainer is name of the container or bucket name used for storage.
  # Directory name in case of local storage provider.
  storageContainer: ""

  # storageProvider indicate the type of backup storage provider.
  # Supported values are ABS,GCS,S3,Swift,OSS,ECS,Local, empty means no backup.
  storageProvider: S3

  # compression defines the specification to compress the snapshots(full as well as delta).
  # it only supports 3 compression Policy: gzip(default), zlib, lzw.
  compression:
    enabled: true
    policy: "zlib"

  # failBelowRevision indicates the revision below which the validation of etcd will fail and restore will not be triggered in case
  # there is no snapshot on configured backup bucket.
  # failBelowRevision: 100000

  # Please uncomment the following section based on the storage provider.
  s3:
    region: 
    secretAccessKey: 
    accessKeyID: 
  #   sseCustomerKey: aes-256-sse-customer-key # optional
  #   sseCustomerAlgorithm: aes-256-sse-customer-algorithm # optional
  # gcs:
  #   serviceAccountJson: service-account-json-with-object-storage-privileges
  #   storageAPIEndpoint: endpoint-override-for-storage-api # optional
  #   emulatorEnabled: boolean-flag-to-configure-etcdbr-to-use-gcs-emulator # optional
  # abs:
  #   storageAccount: storage-account-with-object-storage-privileges
  #   storageKey: storage-key-with-object-storage-privileges
  #   domain: non-default-domain-for-object-storage-service
  #   emulatorEnabled: boolean-float-to-enable-e2e-tests-to-use-azure-emulator # optional
  # swift:
  #   authURL: identity-server-url
  #   domainName: domain-name
  #   username: username-with-object-storage-privileges
  #   password: password
  #   tenantName: tenant-name
  #   regionName: region-name
  # oss:
  #   endpoint: oss-endpoint-url
  #   accessKeySecret: secret-access-key-with-object-storage-privileges
  #   accessKeyID: access-key-id-with-object-storage-privileges
  # ecs:
  #   endpoint: ecs-endpoint-url
  #   secretAccessKey: secret-access-key-with-object-storage-privileges
  #   accessKeyID: access-key-id-with-object-storage-privileges
  #   disableSsl: "false"         # optional
  #   insecureSkipVerify: "false" # optional

# etcdAuth field contains the pre-created username-password pair
# for etcd. Comment this whole section if you dont want to use
# password-based authentication for the etcd.
etcdAuth: {}
  # username: username
  # password: password

etcdTLS: {}
#   caBundle: |
#         -----BEGIN CERTIFICATE-----
#         ...
#         -----END CERTIFICATE-----
#   crt: |
#         -----BEGIN CERTIFICATE-----
#         ...
#         -----END CERTIFICATE-----
#   key: |
#         -----BEGIN RSA PRIVATE KEY-----
#         ...
#         -----END RSA PRIVATE KEY-----

# backupRestoreTLS field contains the pre-created secrets for backup-restore server.
# Comment this whole section if you dont want to use tls for the backup-restore server.
backupRestoreTLS: {}
#   caBundle: |
#         -----BEGIN CERTIFICATE-----
#         ...
#         -----END CERTIFICATE-----
#   crt: |
#         -----BEGIN CERTIFICATE-----
#         ...
#         -----END CERTIFICATE-----
#   key: |
#         -----BEGIN RSA PRIVATE KEY-----
#         ...
#         -----END RSA PRIVATE KEY-----

# podAnnotations that will be passed to the resulting etcd pod
podAnnotations: {}
renormalize commented 1 week ago

Unfortunately, the Helm charts of etcd-backup-restore are not very up to date. This is because most consumers of etcd-backup-restore use it along with the etcd operator etcd-druid, and the wrapper on top of the etcd image etcd-wrapper.

@anveshreddy18 do you have a more up to date Helm chart that can be merged into master? If so we can just update the chart and solve @Federico-Baldan's issue.

anveshreddy18 commented 1 week ago

@anveshreddy18 do you have a more up to date Helm chart that can be merged into master?

@renormalize I had previously updated only for the integration tests to run, where I remember I didn't update the TLS volumes, but for consumption by community, I think we need proper update of these charts. I will test it out and will post updates in this thread.

@Federico-Baldan I will see what I can do for the single node etcd now, for multi-node, etcd-druid is the recommended way to go because of the scale-up feature.

Federico-Baldan commented 1 week ago

@anveshreddy18 Hi, if you could send me an updated Helm chart, I would really appreciate it. I tried installing etcd-druid using a Helm chart, but I encountered the following error. Are these charts up to date?

{“level”:“info”,“ts”:“2024-10-01T07:39:27.696Z”,“logger”:“druid”,“msg”:“Etcd-druid build information”,“Etcd-druid Version”:“v0.22.7”,“Git SHA”:“9fc1162c”}
{“level”:“info”,“ts”:“2024-10-01T07:39:27.696Z”,“logger”:“druid”,“msg”:“Golang runtime information”,“Version”:“go1.21.4”,“OS”:“linux”,“Arch”:“amd64”}
unknown flag: --metrics-port

Thank you so much in advance for your help. It would be a great favor!

renormalize commented 1 week ago

@Federico-Baldan could you check out the tagged version v0.22.7 of etcd-druid, and then trying deploying the charts? The master branch is not fully consumable in its current state as some refactoring is happening right now.

There is one tiny change in values.yaml you will have to do (after checking out to v0.22.7) to get etcd-druid up and running properly.

charts/druid/values.yaml:

crds:
  enabled: true
image:
  repository: europe-docker.pkg.dev/gardener-project/public/gardener/etcd-druid
  tag: v0.22.7
  imagePullPolicy: IfNotPresent
replicas: 1
ignoreOperationAnnotation: false

The charts currently use the latest image tag instead of the actual image tag in the release commits. This is a limitation we've yet to fix. The artifact registry we currently host our images on does not support the latest image tag.

Once you have etcd-druid up and running, you can simply create an etcd cluster using the sample config/samples/druid_v1alpha1_etcd.yaml.

Consuming etcd-backup-restore along with etcd-druid will be more approachable. The other maintainers and I would recommend this over direct consumption of etcd-backup-restore unless you're quite familiar with these components, and wouldn't mind wrestling with configuration.

You can also take a look at #725 (not sure how relevant it will be for you) since it contains a lot of information we've not yet been able to get into the docs.


The reason why you're seeing the unknown flag: --metrics-port log is that a few flags have been renamed (the older flags are still supported, just not recommended and have been removed from the example charts).

Thus these charts can't be used by simply changing the image to the tag that you want to use. You can take a look at the charts on master and v0.22.7 and adapt the charts in v0.22.7 to your needs.

Federico-Baldan commented 1 week ago

@renormalize Thank you very much! Everything is working correctly now. Sorry for the trouble.

It would be helpful for everyone if you could also add a file in the “config/samples/” directory with instructions on how to perform a restore.

Have a great day!

Federico-Baldan commented 1 week ago

@renormalize , Sorry if i bother you again... I've deployed the Helm chart version v0.22.7, and the etcd-druid pod is running fine. However, when I deploy the sample "config/samples/druid_v1alpha1_etcd.yaml", it doesn't create the 3 backup pods for etcd.

I can see the Etcd resource on the CRD in Kubernetes, but nothing happens after that.

I understand this might be related to the backup-restore functionality. Could you take a look and help me figure this out? Thanks!

renormalize commented 1 week ago

@Federico-Baldan If you have deployed an Etcd CR as configured by config/samples/druid_v1alpha1_etcd.yaml, with backups enabled - you have already deployed an etcd-backup-restore container!

Deploying an Etcd CR deploys the etcd container, and an instance of etcd-backup-restore as a sidecar container in one go. You're not able to see 3 backup pods of etcd-backup-restore because each single instance of it you expect is already running as a side car to the etcd. Thus the number of pods will only be 3 (or 1 if you're running a single node etcd).

You can check the containers running in the etcd-test-0 pod through a describe through:

kubectl describe pod etcd-test-0

and will see output similar to this:

Name:             etcd-test-0
Namespace:        default
Priority:         0
...
Containers:
  etcd:
...
    Args: # this is the etcd container 
      start-etcd
      --backup-restore-host-port=etcd-test-local:8080
      --etcd-server-name=etcd-test-local
...
  backup-restore:
...
    Args:
      server # this is the etcd-backup-restore container started with the server command
      --defragmentation-schedule=0 */24 * * *
...

If you have yq then you can directly query the names of the containers through this simple command:

etcd-druid git:(master) ✗ kubectl get pod etcd-test-0 -oyaml | yq '.spec.containers[].name'
etcd
backup-restore

How do you restore the etcd cluster from your backups?

It happens automatically!

As long as you have backups enabled, etcd-backup-restore will ensure to either backup your etcd, and if for some reason the etcd goes down, it will automatically perform a restoration and the etcd is started right back up again!

To understand more about how this happens, you can read how the server command of etcd-backup-restore functions in the docs.


I implore you to go through the docs of both etcd-druid and etcd-backup-restore and if there's any sections which are lacking, please feel free to point them out in the issue! We will try to improve the docs there; and feel free to raise PRs yourself if you find something lacking.

Federico-Baldan commented 1 week ago

Hi @renormalize , regarding the restore, my mistake—I found the documentation and understand it now.

However, for the backup: when I deploy the config/samples/druid_v1alpha1_etcd.yaml, nothing happens. I don’t see any etcd-test-0 pod or anything related to the backup being created.

I’m not sure why this is happening. I kept all the default settings from the chart and the YAML file.

Could you help me troubleshoot this?

renormalize commented 1 week ago

@Federico-Baldan I don't have enough information to troubleshoot, if you could give me a detailed list of steps taken by you I could help.

anveshreddy18 commented 1 week ago

Hi @Federico-Baldan, since you are running an older version, there is one additional step you need to do for etcd-druid to deploy the statefulset and hence the pods. You just need to annotate the etcd CR you deployed by doing

kubectl annotate etcd <Etcd-CR-name> gardener.cloud/operation="reconcile"

This will deploy the pods you are asking for, which takes care of backing up and restoration if a backup store is provided as mentioned in this doc.

Note : This additional step is not required in the new release v0.23.0 but we would not recommend to use this image yet as some fixes are currently underway.

As @renormalize has mentioned, you will get more understanding once you refer the docs. Feel free to comment if you need anything more :)

renormalize commented 1 week ago

@anveshreddy18 thanks for pointing this out! Should've been my first guess.

unmarshall commented 1 week ago

@Federico-Baldan In v0.22.x version we require annotation gardener.cloud/operation="reconcile" to trigger a reconcile of the Etcd custom resource. This we found was a bit inconvenient and we changed this behavior in v0.23.x (which is currently under testing). In v0.23.x creation does not require any explicit trigger for reconciliation, however updates do.

We provide 2 modes to react to updates:

  1. Auto reconcile upon any change made to the Etcd resource. You can achieve this by starting druid with a CLI arg: --enable-etcd-spec-auto-reconcile if using v0.23.x and --ignore-operation-annotation when using v0.22. This is a bit risky in production where generally one prefers to update during maintenance windows to avoid any transient quorum loss.
  2. Explicitly trigger a reconcile by annotating Etcd resource with gardener.cloud/operation="reconcile". This ensures that while you can update the Etcd custom resource it will be reconciled when requested via an explicit trigger.

So you can choose the option depending upon if you are in dev or production mode of consumption and of course your appetite for risk taking :)

Federico-Baldan commented 1 week ago

hi @renormalize when i deploy the config/samples/druid_v1alpha1_etcd.yaml i have this error on the pod of bakcup-restore

{"level":"warn","ts":"2024-10-03T10:40:16.601Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://etcd-main-local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""} time="2024-10-03T10:40:16Z" level=error msg="failed to get status of etcd endPoint: http://etcd-main-local:2379 with error: context deadline exceeded"

could you help me?

Federico-Baldan commented 3 days ago

@renormalize ? any news? {"level":"warn","ts":"2024-10-03T10:40:16.601Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://etcd-main-local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused""} time="2024-10-03T10:40:16Z" level=error msg="failed to get status of etcd endPoint: http://etcd-main-local:2379/ with error: context deadline exceeded