elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
2.59k stars 704 forks source link

EKS IRSA with Elasticsearch uses incorrect role #8090

Open pebrc opened 2 weeks ago

pebrc commented 2 weeks ago

@pebrc I'm sorry, not sure if i should open a new ticket, but this one might have to be reopened.

We are using elasticsearch on EKS with IRSA, the ES version is Version: 8.14.1, Build: docker/93a57a1a76f556d8aee6a90d1a95b06187501310/2024-06-10T23:35:17.114581191Z, JVM: 22.0.1 which should contain the fix.

The relevant part of the ES manifest is this one i believe (we're deploying ES CRD with helm but that shouldn't matter there):

              env:
                - name: AWS_ROLE_SESSION_NAME
                  value: "{{ include "es.name" .}}-cluster-elasticsearch"
                - name: AWS_WEB_IDENTITY_TOKEN_FILE
                  value: "/usr/share/elasticsearch/config/repository-s3/aws-web-identity-token-file"
                - name: AWS_ROLE_ARN
                  value: {{ .Values.serviceAccount.roleArn }}
              volumeMounts:
                - name: aws-iam-token
                  mountPath: /usr/share/elasticsearch/config/repository-s3
          volumes:
            - name: aws-iam-token
              projected:
                defaultMode: 420
                sources:
                  - serviceAccountToken:
                      audience: sts.amazonaws.com
                      expirationSeconds: 86400
                      path: aws-web-identity-token-file

I can verify that symlink is getting created on the node:

elasticsearch@es-prd-es-data-2:~$ ls -al /usr/share/elasticsearch/config/repository-s3
total 4
drwxrwsrwt 3 root elasticsearch  100 Oct  7 14:08 .
drwxrwsrwx 9 root elasticsearch 4096 Oct  7 14:08 ..
drwxr-sr-x 2 root elasticsearch   60 Oct  7 14:08 ..2024_10_07_14_08_45.58930580
lrwxrwxrwx 1 root elasticsearch   30 Oct  7 14:08 ..data -> ..2024_10_07_14_08_45.58930580
lrwxrwxrwx 1 root elasticsearch   34 Oct  7 14:08 aws-web-identity-token-file -> ..data/aws-web-identity-token-file

And I can verify that restore/backup operation works when the cluster is just created.

However, after some time (presumably a couple of days) the restoration stops working, however it seems that this does not stop backup process from working (there are files in s3 and no complains in the logs). In the logs there's an error

``` org.elasticsearch.indices.recovery.RecoveryFailedException: [myindexname][2]: Recovery failed on {es-prd-es-data-1}{dho1YZSWRnqxL1YGJABMkA}{H-QVYCvUTxqJ14ARKQMz6g}{es-prd-es-data-1}{192.168.167.134}{192.168.167.134:9300}{d}{8.14.1}{7000099-8505000}{xpack.installed=true, transform.config_version=10.0.0, k8s_node_name=ip-10-200-100-6.eu-central-1.compute.internal, ml.config_version=12.0.0} at org.elasticsearch.index.shard.IndexShard.lambda$executeRecovery$36(IndexShard.java:3319) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179) at org.elasticsearch.index.shard.StoreRecovery.lambda$recoveryListener$9(StoreRecovery.java:402) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListener$2.onFailure(ActionListener.java:179) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.index.shard.StoreRecovery.lambda$restore$19(StoreRecovery.java:600) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191) at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394) at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306) at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331) at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394) at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306) at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331) at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.action.support.SubscribableListener$FailureResult.complete(SubscribableListener.java:394) at org.elasticsearch.action.support.SubscribableListener.tryComplete(SubscribableListener.java:306) at org.elasticsearch.action.support.SubscribableListener.setResult(SubscribableListener.java:331) at org.elasticsearch.action.support.SubscribableListener.onFailure(SubscribableListener.java:250) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$55(BlobStoreRepository.java:3330) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191) at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62) at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73) at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:31) at org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener.onFailure(ActionListenerImplementations.java:317) at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:151) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:967) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) at java.lang.Thread.run(Thread.java:1570) Caused by: [myindexname/MOkZmjo8T6K46_k4wmAmqA][[myindexname][2]] org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery ... 42 more Caused by: [myindexname/MOkZmjo8T6K46_k4wmAmqA][[myindexname][2]] org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: failed to restore snapshot [new-07-10-24/pB1C3ib7S9-L73yPMU5y9g] ... 14 more Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBZMTJXWXJQKYZ5; S3 Extended Request ID: n7pzR2kAQkQwUl/Ojli4NjoyYckzwY15CfWlRwJwWXLnblKwoymHXiaDQk3A96JqQfQvO/3nxVs=; Proxy: null) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1879) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleServiceErrorResponse(AmazonHttpClient.java:1418) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1387) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5456) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5403) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1524) at org.elasticsearch.repositories.s3.S3RetryingInputStream.lambda$openStreamWithRetry$0(S3RetryingInputStream.java:100) at java.security.AccessController.doPrivileged(AccessController.java:319) at org.elasticsearch.repositories.s3.SocketAccess.doPrivileged(SocketAccess.java:31) at org.elasticsearch.repositories.s3.S3RetryingInputStream.openStreamWithRetry(S3RetryingInputStream.java:100) at org.elasticsearch.repositories.s3.S3RetryingInputStream.(S3RetryingInputStream.java:85) at org.elasticsearch.repositories.s3.S3RetryingInputStream.(S3RetryingInputStream.java:67) at org.elasticsearch.repositories.s3.S3BlobContainer.readBlob(S3BlobContainer.java:104) at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:123) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.loadShardSnapshot(BlobStoreRepository.java:3629) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$restoreShard$56(BlobStoreRepository.java:3347) at org.elasticsearch.action.ActionRunnable$4.doRun(ActionRunnable.java:100) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:984) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ... 3 more Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBWZA7Q51PEW6DV; S3 Extended Request ID: iGEfqH4690vYswR7GlaM9SfJcG0ZLYOobdnU+yDviPT9xKunhEVklcYoNvCsQvyFcopsQgJArIQ=; Proxy: null) ... 30 more Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBJ8E9ND66D4CB8; S3 Extended Request ID: J9YYcZGRbVdceB8Ny3Rt33+vsGT/3TwzStlGRPy8yKaROLvOT+yWv5K0hRtlBNJY9Nz4p1X5/uk=; Proxy: null) ... 30 more Suppressed: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: amazon_s3_exception: User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: YMBTMKXD6H0BWYF7; S3 Extended Request ID: mHqdDKbOFJs7LPgVSfgWUOiI6K7VEMHBWPtvLxm/P5uJKTJPSXROTqKJjzupO1VC+4rnZRhW1yw=; Proxy: null) ... 30 more ```

that boils down to just that:

User: arn:aws:sts::123456789023:assumed-role/prd-node-role-20241001140543910500000003/i-123456789abcde is not authorized to perform: s3:GetObject on resource: "arn:aws:s3:::my-backup-bucket/last-backup/indices/wQRa5pGbSX--weZJUCjgew/2/snap-pB1C3ib7S9-L73yPMU5y9g.dat" because no identity-based policy allows the s3:GetObject action

Which means that the original role from IRSA and AWS_WEB_IDENTITY_TOKEN_FILE is dropped when ES tried to do that, and it used a node role (prd-node-role-20241001140543910500000003) as opposed to the pod role (prd-ESBackupRole). Installing AWS-cli on the machine shows the correct role (prd-ESBackupRole):

 PAGER='' HOME=/tmp/ ./aws/dist/aws sts get-caller-identity
{
    "UserId": "WHATEVER:es-cluster-elasticsearch",
    "Account": "123456789012",
    "Arn": "arn:aws:sts::123456789012:assumed-role/prd-ESBackupRole/es-cluster-elasticsearch"
}

I'd really appreciate any pointers or help in order to troubleshoot that. Thanks in advance.

Originally posted by @ragne in #7208

pebrc commented 2 weeks ago

@ragne I created a new issue to track this. We need to report this to the Elasticsearch team as it is very likely an Elasticsearch issue.

pebrc commented 1 week ago

@ragne I have been running an IRSA setup for multiple days now and have so far not been able to reproduce your problem. I did run into another problem (with a closed connection pool) which I reported to the Elasticsearch team. But the incorrect use of the role I could not reproduce. I am taking hourly snapshots and have been restoring a few of them. So far without a problem.

ahmetdd commented 6 days ago

@pebrc How did you make any changes to the IRSA setup ? Do you think the problem is completely solved ?