FoundationDB / fdb-kubernetes-operator

A kubernetes operator for FoundationDB
Apache License 2.0
240 stars 83 forks source link

How to run backups with --secure-connection=0? #961

Closed funkypenguin closed 2 years ago

funkypenguin commented 2 years ago

Hey guys!

I'm deliberately trying to avoid using a secure connection when doing backups, since I'm backing up to a Minio object store within my cluster, and the whole bang-shoot is already encrypted with mTLS.

I've managed to manually make a successful backup by execing into an FDB container, and appending &sc=0 to my blobstore URL. How would I achieve this using the operator, since (at first glance) the operator runs backup_agent, which doesn't seem to include such an option?

My current hacky attempt (which may not even work) is simply to append &sc=0 to the name of the bucket I define in the FoundationDBBackup resource :)

Cheers! D

johscheuer commented 2 years ago

We have an issue for that: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/931 feel free to work on it. Append it to the bucket should work at least I'm not aware of anything that should block it and that should still result in a valid blob store URL for FDB.

funkypenguin commented 2 years ago

Thanks @johscheuer - How does the FoundationDBBackup resource configure the backup_agent process? When I look at the pods created, I just see:

UID          PID    PPID  C STIME TTY          TIME CMD
nobody         1       0  0 Oct25 ?        00:01:24 backup_agent --log --logdir /var/log/fdb-trace-logs
nobody        66       0  0 06:59 pts/0    00:00:00 /bin/bash
nobody        74      66  0 06:59 pts/0    00:00:00 ps -ef

Where is the blob store URL passed to backup_agent?

Thanks! :) D

johscheuer commented 2 years ago

You can see it either in the operator logs when the blob store is configured or by using fdbbackup status that will show the blob store and the status.

johscheuer commented 2 years ago

We have some additional docs here: https://github.com/FoundationDB/fdb-kubernetes-operator/tree/master/config/tests/backup but we may want to add the required commands to the user manual or at least reference them.

funkypenguin commented 2 years ago

OK, some progress here... appending ?sc=0 to the bucket name seems to be accepted by the operator, but it still produces an error, indicating that the fdbbackup command has been killed with a -1:

{"level":"info","ts":1635294282.8777318,"logger":"fdbclient","msg":"Running command","namespace":"preview-database-pr-209","cluster":"retort-fdb","path":"/usr/bin/fdb/6.3/fdbbackup","args":["/usr/bin/fdb/6.3/fdbbackup","start","-d","blobstore://minio@minio-hl-istio-compatible:9000/retort-fdb?bucket=fdb-dev2&sc=0","-s","864000","-z","-C","/tmp/500804990","--log","--logdir","/var/log/fdb"]}
{"level":"error","ts":1635294292.935042,"logger":"fdbclient","msg":"Error from FDB command","namespace":"preview-database-pr-209","cluster":"retort-fdb","code":-1,"stdout":"","stderr":"","error":"signal: killed","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/log.(*DelegatingLogger).Error\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/log/deleg.go:144\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).runCommand\n\t/workspace/fdbclient/admin_client.go:187\ngithub.com/FoundationDB/fdb-kubernetes-operator/fdbclient.(*cliAdminClient).StartBackup\n\t/workspace/fdbclient/admin_client.go:449\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.StartBackup.Reconcile\n\t/workspace/controllers/start_backup.go:45\ngithub.com/FoundationDB/fdb-kubernetes-operator/controllers.(*FoundationDBBackupReconciler).Reconcile\n\t/workspace/controllers/backup_controller.go:87\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:298\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.8.3/pkg/internal/controller/controller.go:216\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.20.4/pkg/util/wait/wait.go:99"}

If I exec into one of the backupagent containers and run the command though, it seems to work:

nobody@retort-fdb-backup-agents-5855bd7969-6hbnr:/$ fdbbackup start -d 'blobstore://minio@minio-hl-istio-compatible:9000/retort-fdb?bucket=fdb-dev2&sc=0' -s 864000
The backup on tag `default' was successfully submitted.
nobody@retort-fdb-backup-agents-5855bd7969-6hbnr:/$

How would I debug further?

Thanks! D

funkypenguin commented 2 years ago

Another oddity - neither my FDB stores pods, nor the fdb-backup-agent pods have a /var/log/fdb folder. Where is this fdbbackup start process supposed to be executed from? :)

johscheuer commented 2 years ago

The best thing is to look at the trace logs in the operator to get some additional information why the operator is not able to make that request. The default log dir for Pods created by the operator is /var/log/fdb-trace-logs. Since you're using Istio maybe the operator is not able to talk to Minio?

funkypenguin commented 2 years ago

Thanks @johscheuer, I'll double-check that. To be clear, both fdb-operator and the backup-agent pods need to be able to talk to the target blobstore? Out of interest, why are we instructing fdbbackup to use /var/log/fdb if none of the pods involved have such a directory?

johscheuer commented 2 years ago

We should instruct the backup agent to use /var/log/fdb-trace-logs: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/master/internal/pod_models.go#L759 that's also what you showed: https://github.com/FoundationDB/fdb-kubernetes-operator/issues/961#issuecomment-951619527 I think we only use /var/log/fdb for the operator. Which should be adjusted for consistency.

funkypenguin commented 2 years ago

Hi guys, I've gone further into debugging this..

The docs suggest that it's necessary to configure the operator with the blobstore credentials and TLS keys, but there's no mechanism in the helm chart for adding volumeMounts or extra ENV vars to the operator, so how should this be achieved? In my particular case, I don't need the ENV vars for TLS since I'm explicitly disabling secure connections, but I'd imagine I would need the credentials...

To test, I hard-coded the credentials into the account portion of the blobstorage URL, and I seem to have progress!

(Albeit 0.01% doesn't seem super-exciting)

nobody@retort-fdb-backup-agents-86c6b59c64-5gt2k:/$ fdbbackup status
The backup on tag `default' is restorable but continuing to blobstore://minio:miniostorage@minio-hl-istio-compatible:9000/retort-fdb?bucket=fdb-dev&sc=0.
BackupUID: 4e2725cfe9a83eb398aeddd87fd94201
BackupURL: blobstore://minio:miniostorage@minio-hl-istio-compatible:9000/retort-fdb?bucket=fdb-dev&sc=0
Snapshot interval is 864000 seconds.  Current snapshot progress target is 0.01% (>100% means the snapshot is supposed to be done)

Details:
 LogBytes written - 52121
 RangeBytes written - 8046112
 Last complete log version and timestamp        - 510680106, 2021/10/27.21:39:37+0000
 Last complete snapshot version and timestamp   - 430972722, 2021/10/27.21:38:17+0000
 Current Snapshot start version and timestamp   - 431088969, 2021/10/27.21:38:17+0000
 Expected snapshot end version and timestamp    - 864431088969, 2021/11/06.21:38:17+0000
 Backup supposed to stop at next snapshot completion - No

Would you like a PR to add arbitrary volumeMounts/ENV vars to the operator deployment? :)

D

johscheuer commented 2 years ago

I think a PR to allow setting volumeMounts and env make sense 👍

johscheuer commented 2 years ago

I'm going ahead and close this issue since the actual question was resolved. Feel free to create a PR with the arbitrary volumeMounts/ENV option and reopen the issue if there are any other questions.