Open Federico-Baldan opened 1 week ago
Unfortunately, the Helm charts of etcd-backup-restore are not very up to date. This is because most consumers of etcd-backup-restore use it along with the etcd operator etcd-druid, and the wrapper on top of the etcd image etcd-wrapper.
@anveshreddy18 do you have a more up to date Helm chart that can be merged into master? If so we can just update the chart and solve @Federico-Baldan's issue.
@anveshreddy18 do you have a more up to date Helm chart that can be merged into master?
@renormalize I had previously updated only for the integration tests to run, where I remember I didn't update the TLS volumes, but for consumption by community, I think we need proper update of these charts. I will test it out and will post updates in this thread.
@Federico-Baldan I will see what I can do for the single node etcd now, for multi-node, etcd-druid is the recommended way to go because of the scale-up feature.
@anveshreddy18 Hi, if you could send me an updated Helm chart, I would really appreciate it. I tried installing etcd-druid using a Helm chart, but I encountered the following error. Are these charts up to date?
{“level”:“info”,“ts”:“2024-10-01T07:39:27.696Z”,“logger”:“druid”,“msg”:“Etcd-druid build information”,“Etcd-druid Version”:“v0.22.7”,“Git SHA”:“9fc1162c”}
{“level”:“info”,“ts”:“2024-10-01T07:39:27.696Z”,“logger”:“druid”,“msg”:“Golang runtime information”,“Version”:“go1.21.4”,“OS”:“linux”,“Arch”:“amd64”}
unknown flag: --metrics-port
Thank you so much in advance for your help. It would be a great favor!
@Federico-Baldan could you check out the tagged version v0.22.7 of etcd-druid, and then trying deploying the charts? The master branch is not fully consumable in its current state as some refactoring is happening right now.
There is one tiny change in values.yaml
you will have to do (after checking out to v0.22.7) to get etcd-druid up and running properly.
charts/druid/values.yaml
:
crds:
enabled: true
image:
repository: europe-docker.pkg.dev/gardener-project/public/gardener/etcd-druid
tag: v0.22.7
imagePullPolicy: IfNotPresent
replicas: 1
ignoreOperationAnnotation: false
The charts currently use the latest
image tag instead of the actual image tag in the release commits. This is a limitation we've yet to fix. The artifact registry we currently host our images on does not support the latest
image tag.
Once you have etcd-druid up and running, you can simply create an etcd cluster using the sample config/samples/druid_v1alpha1_etcd.yaml
.
Consuming etcd-backup-restore along with etcd-druid will be more approachable. The other maintainers and I would recommend this over direct consumption of etcd-backup-restore unless you're quite familiar with these components, and wouldn't mind wrestling with configuration.
You can also take a look at #725 (not sure how relevant it will be for you) since it contains a lot of information we've not yet been able to get into the docs.
The reason why you're seeing the unknown flag: --metrics-port
log is that a few flags have been renamed (the older flags are still supported, just not recommended and have been removed from the example charts).
Thus these charts can't be used by simply changing the image to the tag that you want to use. You can take a look at the charts on master and v0.22.7 and adapt the charts in v0.22.7 to your needs.
@renormalize Thank you very much! Everything is working correctly now. Sorry for the trouble.
It would be helpful for everyone if you could also add a file in the “config/samples/” directory with instructions on how to perform a restore.
Have a great day!
@renormalize , Sorry if i bother you again... I've deployed the Helm chart version v0.22.7, and the etcd-druid pod is running fine. However, when I deploy the sample "config/samples/druid_v1alpha1_etcd.yaml", it doesn't create the 3 backup pods for etcd.
I can see the Etcd resource on the CRD in Kubernetes, but nothing happens after that.
I understand this might be related to the backup-restore functionality. Could you take a look and help me figure this out? Thanks!
@Federico-Baldan If you have deployed an Etcd
CR as configured by config/samples/druid_v1alpha1_etcd.yaml
, with backups enabled - you have already deployed an etcd-backup-restore container!
Deploying an Etcd
CR deploys the etcd container, and an instance of etcd-backup-restore as a sidecar container in one go. You're not able to see 3 backup pods of etcd-backup-restore because each single instance of it you expect is already running as a side car to the etcd. Thus the number of pods will only be 3
(or 1
if you're running a single node etcd).
You can check the containers running in the etcd-test-0
pod through a describe through:
kubectl describe pod etcd-test-0
and will see output similar to this:
Name: etcd-test-0
Namespace: default
Priority: 0
...
Containers:
etcd:
...
Args: # this is the etcd container
start-etcd
--backup-restore-host-port=etcd-test-local:8080
--etcd-server-name=etcd-test-local
...
backup-restore:
...
Args:
server # this is the etcd-backup-restore container started with the server command
--defragmentation-schedule=0 */24 * * *
...
If you have yq
then you can directly query the names of the containers through this simple command:
etcd-druid git:(master) ✗ kubectl get pod etcd-test-0 -oyaml | yq '.spec.containers[].name'
etcd
backup-restore
How do you restore the etcd cluster from your backups?
It happens automatically!
As long as you have backups enabled, etcd-backup-restore will ensure to either backup your etcd, and if for some reason the etcd goes down, it will automatically perform a restoration and the etcd is started right back up again!
To understand more about how this happens, you can read how the server
command of etcd-backup-restore functions in the docs.
I implore you to go through the docs of both etcd-druid and etcd-backup-restore and if there's any sections which are lacking, please feel free to point them out in the issue! We will try to improve the docs there; and feel free to raise PRs yourself if you find something lacking.
Hi @renormalize , regarding the restore, my mistake—I found the documentation and understand it now.
However, for the backup: when I deploy the config/samples/druid_v1alpha1_etcd.yaml, nothing happens. I don’t see any etcd-test-0 pod or anything related to the backup being created.
I’m not sure why this is happening. I kept all the default settings from the chart and the YAML file.
Could you help me troubleshoot this?
@Federico-Baldan I don't have enough information to troubleshoot, if you could give me a detailed list of steps taken by you I could help.
Hi @Federico-Baldan, since you are running an older version, there is one additional step you need to do for etcd-druid
to deploy the statefulset
and hence the pods. You just need to annotate the etcd
CR you deployed by doing
kubectl annotate etcd <Etcd-CR-name> gardener.cloud/operation="reconcile"
This will deploy the pods you are asking for, which takes care of backing up and restoration if a backup store is provided as mentioned in this doc.
Note : This additional step is not required in the new release v0.23.0
but we would not recommend to use this image yet as some fixes are currently underway.
As @renormalize has mentioned, you will get more understanding once you refer the docs. Feel free to comment if you need anything more :)
@anveshreddy18 thanks for pointing this out! Should've been my first guess.
@Federico-Baldan In v0.22.x version we require annotation gardener.cloud/operation="reconcile"
to trigger a reconcile of the Etcd
custom resource. This we found was a bit inconvenient and we changed this behavior in v0.23.x (which is currently under testing). In v0.23.x creation does not require any explicit trigger for reconciliation, however updates do.
We provide 2 modes to react to updates:
Etcd
resource. You can achieve this by starting druid with a CLI arg: --enable-etcd-spec-auto-reconcile
if using v0.23.x and --ignore-operation-annotation
when using v0.22. This is a bit risky in production where generally one prefers to update during maintenance windows to avoid any transient quorum loss.Etcd
resource with gardener.cloud/operation="reconcile"
. This ensures that while you can update the Etcd
custom resource it will be reconciled when requested via an explicit trigger.So you can choose the option depending upon if you are in dev or production mode of consumption and of course your appetite for risk taking :)
hi @renormalize when i deploy the config/samples/druid_v1alpha1_etcd.yaml i have this error on the pod of bakcup-restore
{"level":"warn","ts":"2024-10-03T10:40:16.601Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://etcd-main-local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""} time="2024-10-03T10:40:16Z" level=error msg="failed to get status of etcd endPoint: http://etcd-main-local:2379 with error: context deadline exceeded"
could you help me?
@renormalize ? any news? {"level":"warn","ts":"2024-10-03T10:40:16.601Z","caller":"clientv3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"passthrough:///http://etcd-main-local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused""} time="2024-10-03T10:40:16Z" level=error msg="failed to get status of etcd endPoint: http://etcd-main-local:2379/ with error: context deadline exceeded
Hi, when i deploy using the helm chart using this:
"etcdBackupRestore: repository: europe-docker.pkg.dev/gardener-project/releases/gardener/etcdbrctl tag: v0.30.1 pullPolicy: IfNotPresent"
it gives me this error: "Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "etcdbrctl": executable file not found in $PATH: unknown"
this is my values.yaml: