CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.96k stars 594 forks source link

Backup to S3 using IRSA does not work #4006

Open jeffgus opened 1 month ago

jeffgus commented 1 month ago

Overview

I'm unable to get the backup to S3 to work with a service account and IAM role (IRSA).

Environment

Steps to Reproduce

Create an IAM role in AWS with a Trust Relationship. Make sure that the ServiceAccounts are annotated. set: repo2-s3-key-type = web-id set bucket name, region, and endpoint.

I set s3.conf to be:

[global]
repo2-retention-full = 14
repo2-retention-full-type = time
repo2-s3-key-type = web-id

I'm not sure if these settings belong in the s3.conf file or the main config file. I've tried both.

EXPECTED

The pgbackrest should be able to find the token to commicate with the s3 bucket.

ACTUAL

I get one of two errors. I get an error saying that AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE env vars are missing. If I override the metadata for all serviceaccounts and edit the StatefulSet for the repo-host settings the serviceAccountName, that error goes away. It is replaced with:

command terminated with exit code 29: ERROR: [029]: unable to find child 'AssumeRoleWithWebIdentityResult':0 in node 'ErrorResponse'

Logs

command terminated with exit code 31: ERROR: [031]: option 'repo2-s3-key-type' is 'web-id' but 'AWS_ROLE_ARN' and 'AWS_WEB_IDENTITY_TOKEN_FILE' are not set

or

command terminated with exit code 29: ERROR: [029]: unable to find child 'AssumeRoleWithWebIdentityResult':0 in node 'ErrorResponse'

Additional Information

This is similar to #3135 and #3472, but these issues are old and things have changed.

I tried to tweak the role trust relationship rule and it doesn't seem to make a difference. I can run a container with awscli with the same serviceAccount and it works fine.

I can also try to run pgbackrest on the repo-node manually. It fails to properly backup (which is expected), bit it DOES communicate with S3 and creates the backup.info file.

What is the correct configuration for this to work?

jvincze84 commented 4 weeks ago

Hi, We have the same (or similar issue). We running OKD on AWS. OKD Version: 4.15.0-0.okd-2024-03-10-010116

Log:

time="2024-10-29T13:20:06Z" level=info msg="crunchy-pgbackrest starts"
time="2024-10-29T13:20:06Z" level=info msg="debug flag set to false"
time="2024-10-29T13:20:06Z" level=info msg="backrest backup command requested"
time="2024-10-29T13:20:06Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1]"
time="2024-10-29T13:20:07Z" level=info msg="output=[]"
time="2024-10-29T13:20:07Z" level=info msg="stderr=[ERROR: [031]: option 'repo1-s3-key-type' is 'web-id' but 'AWS_ROLE_ARN' and 'AWS_WEB_IDENTITY_TOKEN_FILE' are not set\n]"
time="2024-10-29T13:20:07Z" level=fatal msg="command terminated with exit code 31"

But these system evironments are in place. I checked in a debug pod. I also checked web id token and role with aws cli and i was able to upload files to the bucket.

Can somebody help? It seems that the error message is missleading and there are other issues behind the scene. But without proper log message we cannot contiunue debugging.

Thanks, Jvincze84

jeffgus commented 3 weeks ago

I think the issue is how the backup runs. When I set the annotation, the cronjob runs with the AWS_ROLE_ARN, etc set. When I remove the "volume" from the s3 repo definition, the operator complains:

Stanza not created for \"repo2\" as specified for a scheduled backup

I don't think s3 repo's should have a volume section, but I can't make the operator write out the config without one. When it has a volume, then it interacts with the repo host which does NOT have AWS_ROLE_ARN set.