Failed to get IAM credentials from IAM role for service account when running with AppMesh envoy container

rimaulana commented 2 years ago

### Describe the question/issue When using aws-for-fluent-bit as a sidecar container combined with injected AppMesh envoy proxy, aws-for-fluent-bit is failing to use IAM credentials assigned via IAM role for service account (IRSA) and fallback to use IAM credential from the underlying worker node. The issue here is a race condition between envoy and aws-for-fluent-bit container since all outbound internet connection that is not coming from envoy will be redirected to envoy and when envoy is not fully up, connection to STS by aws-for-fluent-bit failed (as reflected on the log) and aws-for-fluent-bit decided to fallback and requesting credentials from IMDS In ECS task, there is a container start-up dependencies that make this setup work but unfortunately there is not such thing in Kubernetes. The workaround we did was to add sleeping command on aws-for-fluent-bit to wait for envoy to fully up before starting fluenbit binary. ### Configuration ``` apiVersion: v1 kind: Pod metadata: spec: containers: - env: - name: AWS_DEFAULT_REGION value: eu-central-1 - name: AWS_REGION value: eu-central-1 - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token image: imagePullPolicy: IfNotPresent name: resources: limits: cpu: 100m memory: 2048M requests: cpu: 100m memory: 512M securityContext: allowPrivilegeEscalation: false privileged: false readOnlyRootFilesystem: true runAsUser: 1000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-sfdg2 readOnly: true - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true - env: - name: AWS_DEFAULT_REGION value: eu-central-1 - name: AWS_REGION value: eu-central-1 - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token image: /aws-for-fluent-bit:2.23.2 imagePullPolicy: IfNotPresent name: fluentbit ports: - containerPort: 2020 name: web protocol: TCP resources: limits: cpu: 100m memory: 2048M requests: cpu: 100m memory: 512M securityContext: allowPrivilegeEscalation: false privileged: false readOnlyRootFilesystem: true runAsUser: 1000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /fluent-bit/etc name: fluentbit-config readOnly: true - mountPath: /work name: fluentbit-work - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-sfdg2 readOnly: true - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true - env: - name: APPMESH_VIRTUAL_NODE_NAME value: - name: AWS_REGION value: eu-central-1 - name: APPMESH_PREVIEW value: "0" - name: ENVOY_LOG_LEVEL value: info - name: ENVOY_ADMIN_ACCESS_PORT value: "9901" - name: ENVOY_ADMIN_ACCESS_LOG_FILE value: /tmp/envoy_admin_access.log - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token image: 840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.20.0.1-prod imagePullPolicy: IfNotPresent lifecycle: preStop: exec: command: - sh - -c - sleep 20 name: envoy ports: - containerPort: 9901 name: stats protocol: TCP readinessProbe: exec: command: - sh - -c - curl -s http://localhost:9901/server_info | grep state | grep -q LIVE failureThreshold: 3 initialDelaySeconds: 1 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: requests: cpu: 10m memory: 32Mi securityContext: runAsUser: 1337 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-sfdg2 readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true imagePullSecrets: - name: initContainers: - env: - name: APPMESH_START_ENABLED value: "1" - name: APPMESH_IGNORE_UID value: "1337" - name: APPMESH_ENVOY_INGRESS_PORT value: "15000" - name: APPMESH_ENVOY_EGRESS_PORT value: "15001" - name: APPMESH_APP_PORTS value: "" - name: APPMESH_EGRESS_IGNORED_IP value: 169.254.169.254 - name: APPMESH_EGRESS_IGNORED_PORTS value: "22" - name: AWS_DEFAULT_REGION value: eu-central-1 - name: AWS_REGION value: eu-central-1 - name: AWS_ROLE_ARN value: - name: AWS_WEB_IDENTITY_TOKEN_FILE value: /var/run/secrets/eks.amazonaws.com/serviceaccount/token image: 840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-proxy-route-manager:v4-prod imagePullPolicy: IfNotPresent name: proxyinit resources: requests: cpu: 10m memory: 32Mi securityContext: capabilities: add: - NET_ADMIN terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount name: aws-iam-token readOnly: true - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-sfdg2 readOnly: true nodeName: preemptionPolicy: PreemptLowerPriority priority: 0 restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 65534 runAsGroup: 65534 runAsUser: 0 serviceAccount: serviceAccountName: terminationGracePeriodSeconds: 30 ``` ### Fluent Bit Log Output ``` [1mFluent Bit v1.8.14[0m * [1m[93mCopyright (C) 2015-2021 The Fluent Bit Authors[0m * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd * https://fluentbit.io [2022/03/30 14:41:25] [ info] Configuration: [2022/03/30 14:41:25] [ info] flush time | 5.000000 seconds [2022/03/30 14:41:25] [ info] grace | 5 seconds [2022/03/30 14:41:25] [ info] daemon | 0 [2022/03/30 14:41:25] [ info] ___________ [2022/03/30 14:41:25] [ info] inputs: [2022/03/30 14:41:25] [ info] tail [2022/03/30 14:41:25] [ info] tail [2022/03/30 14:41:25] [ info] ___________ [2022/03/30 14:41:25] [ info] filters: [2022/03/30 14:41:25] [ info] ___________ [2022/03/30 14:41:25] [ info] outputs: [2022/03/30 14:41:25] [ info] es.0 [2022/03/30 14:41:25] [ info] es.1 [2022/03/30 14:41:25] [ info] ___________ [2022/03/30 14:41:25] [ info] collectors: [2022/03/30 14:41:25] [ info] [engine] started (pid=1) [2022/03/30 14:41:25] [debug] [engine] coroutine stack size: 24576 bytes (24.0K) [2022/03/30 14:41:25] [debug] [storage] [cio stream] new stream registered: tail.0 [2022/03/30 14:41:25] [debug] [storage] [cio stream] new stream registered: tail.1 [2022/03/30 14:41:25] [ info] [storage] version=1.1.6, initializing... [2022/03/30 14:41:25] [ info] [storage] in-memory [2022/03/30 14:41:25] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128 [2022/03/30 14:41:25] [ info] [cmetrics] version=0.2.2 [2022/03/30 14:41:27] [debug] [es:es.0] created event channels: read=37 write=38 [2022/03/30 14:41:27] [debug] [out_es] Enabled AWS Auth [2022/03/30 14:41:27] [debug] [aws_credentials] Initialized Env Provider in standard chain [2022/03/30 14:41:27] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain [2022/03/30 14:41:27] [debug] [aws_credentials] Initialized EKS Provider in standard chain [2022/03/30 14:41:27] [debug] [aws_credentials] Not initializing ECS Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is not set [2022/03/30 14:41:27] [debug] [aws_credentials] Initialized EC2 Provider in standard chain [2022/03/30 14:41:27] [debug] [aws_credentials] Sync called on the EKS provider [2022/03/30 14:41:27] [debug] [aws_credentials] Sync called on the EC2 provider [2022/03/30 14:41:27] [debug] [aws_credentials] Init called on the env provider [2022/03/30 14:41:27] [debug] [aws_credentials] Init called on the profile provider [2022/03/30 14:41:27] [debug] [aws_credentials] Reading shared config file. [2022/03/30 14:41:27] [debug] [aws_credentials] Shared config file /.aws/config does not exist [2022/03/30 14:41:27] [debug] [aws_credentials] Reading shared credentials file. [2022/03/30 14:41:27] [debug] [aws_credentials] Shared credentials file /.aws/credentials does not exist [2022/03/30 14:41:27] [debug] [aws_credentials] Init called on the EKS provider [2022/03/30 14:41:27] [debug] [aws_credentials] Calling STS.. [2022/03/30 14:41:27] [debug] [net] socket #39 could not connect to 54.239.54.197:443 [2022/03/30 14:41:27] [debug] [net] could not connect to sts.eu-central-1.amazonaws.com:443 [2022/03/30 14:41:27] [debug] [upstream] connection #-1 failed to sts.eu-central-1.amazonaws.com:443 [2022/03/30 14:41:27] [debug] [aws_client] connection initialization error [2022/03/30 14:41:27] [debug] [aws_credentials] STS assume role request failed [2022/03/30 14:41:27] [debug] [aws_credentials] Init called on the EC2 IMDS provider [2022/03/30 14:41:27] [debug] [aws_credentials] requesting credentials from EC2 IMDS [2022/03/30 14:41:27] [debug] [http_client] not using http_proxy for header [2022/03/30 14:41:27] [debug] [http_client] server 169.254.169.254:80 will close connection #39 [2022/03/30 14:41:27] [debug] [aws_client] (null): http_do=0, HTTP Status: 401 [2022/03/30 14:41:27] [debug] [http_client] not using http_proxy for header [2022/03/30 14:41:28] [ warn] [net] io_read #39 timeout after 1 seconds from: 169.254.169.254:80 [2022/03/30 14:41:28] [error] [src/flb_network.c:224 errno=9] Bad file descriptor [2022/03/30 14:41:28] [error] [http_client] broken connection to 169.254.169.254:80 ? [2022/03/30 14:41:28] [debug] [aws_client] (null): http_do=-1, HTTP Status: 0 [2022/03/30 14:41:28] [debug] [upstream] KA connection #39 to 169.254.169.254:80 could not be registered, closing. AWS for Fluent Bit Container Image Version 2.23.2[2022/03/30 14:41:28] [ Error] epoll_ctl: Bad file descriptr, errno=9 at /tmp/fluent-bit-1.8.14/lib/monkey/mk_core/mk_event_epoll.c:136 [2022/03/30 14:41:28] [debug] [http_client] not using http_proxy for header [2022/03/30 14:41:28] [debug] [http_client] server 169.254.169.254:80 will close connection #39 [2022/03/30 14:41:28] [ info] [imds] to use IMDSv2, set --http-put-response-limit to 2 [2022/03/30 14:41:28] [ warn] [imds] falling back on IMDSv1 [2022/03/30 14:41:28] [debug] [imds] using IMDSv1 [2022/03/30 14:41:28] [debug] [http_client] not using http_proxy for header [2022/03/30 14:41:28] [debug] [http_client] server 169.254.169.254:80 will close connection #39 [2022/03/30 14:41:28] [debug] [aws_credentials] Requesting credentials for instance role [2022/03/30 14:41:28] [debug] [imds] using IMDSv1 [2022/03/30 14:41:28] [debug] [http_client] not using http_proxy for header [2022/03/30 14:41:28] [debug] [http_client] server 169.254.169.254:80 will close connection #39 [2022/03/30 14:41:28] [debug] [aws_credentials] Async called on the EKS provider [2022/03/30 14:41:28] [debug] [aws_credentials] Async called on the EC2 provider [2022/03/30 14:41:28] [debug] [aws_credentials] upstream_set called on the EKS provider [2022/03/30 14:41:28] [debug] [aws_credentials] upstream_set called on the EC2 provider [2022/03/30 14:41:28] [debug] [output:es:es.0] host=.es.amazonaws.com port=443 uri=/_bulk index=fluent-bit type=_doc [2022/03/30 14:41:28] [debug] [es:es.1] created event channels: read=49 write=50 [2022/03/30 14:41:28] [ info] [output:es:es.0] worker #0 started [2022/03/30 14:41:28] [ info] [output:es:es.0] worker #1 started [2022/03/30 14:41:28] [debug] [out_es] Enabled AWS Auth [2022/03/30 14:41:28] [debug] [aws_credentials] Initialized Env Provider in standard chain [2022/03/30 14:41:28] [debug] [aws_credentials] Initialized AWS Profile Provider in standard chain [2022/03/30 14:41:28] [debug] [aws_credentials] Initialized EKS Provider in standard chain [2022/03/30 14:41:28] [debug] [aws_credentials] Not initializing ECS Provider because AWS_CONTAINER_CREDENTIALS_RELATIVE_URI is not set [2022/03/30 14:41:28] [debug] [aws_credentials] Initialized EC2 Provider in standard chain [2022/03/30 14:41:28] [debug] [aws_credentials] Sync called on the EKS provider [2022/03/30 14:41:28] [debug] [aws_credentials] Sync called on the EC2 provider [2022/03/30 14:41:28] [debug] [aws_credentials] Init called on the env provider [2022/03/30 14:41:28] [debug] [aws_credentials] Init called on the profile provider [2022/03/30 14:41:28] [debug] [aws_credentials] Reading shared config file. [2022/03/30 14:41:28] [debug] [aws_credentials] Shared config file /.aws/config does not exist [2022/03/30 14:41:28] [debug] [aws_credentials] Reading shared credentials file. [2022/03/30 14:41:28] [debug] [aws_credentials] Shared credentials file /.aws/credentials does not exist [2022/03/30 14:41:28] [debug] [aws_credentials] Init called on the EKS provider [2022/03/30 14:41:28] [debug] [aws_credentials] Calling STS.. [2022/03/30 14:41:28] [debug] [http_client] not using http_proxy for header [2022/03/30 14:41:28] [debug] [upstream] KA connection #59 to sts.eu-central-1.amazonaws.com:443 is now available [2022/03/30 14:41:28] [debug] [aws_credentials] Async called on the EKS provider [2022/03/30 14:41:28] [debug] [aws_credentials] Async called on the EC2 provider [2022/03/30 14:41:28] [debug] [aws_credentials] upstream_set called on the EKS provider [2022/03/30 14:41:28] [debug] [aws_credentials] upstream_set called on the EC2 provider [2022/03/30 14:41:28] [debug] [output:es:es.1] host=.es.amazonaws.com port=443 uri=/_bulk index=fluent-bit type=_doc ``` ### Fluent Bit Version Info aws-for-fluent-bit:2.23.2

Cluster Details

Solutions running on EKS with AppMesh on EC2 worker nodes and aws-for-fluent-bit as a sidecar

Steps to reproduce issue

Create a pod with aws-for-fluent-bit as sidecar and have it to be part of AppMesh node (being injected with AppMesh envoy container and proxyinit)

matthewfala commented 2 years ago

Just to confirm, you are trying to find an alternative to waiting some random amount of time for the sts endpoint to become available?

I'm wondering if it's possible for you to wrap the aws-for-fluent-bit image in a custom image that waits before the entry point is called for the sts endpoint to become available. That way fluent bit will be able to detect the endpoint.

Something like the following could be added before the entry point is called. Just an idea. What do you think?

until $(curl --output /dev/null --silent --head --fail https://sts.us-west-2.amazonaws.com); do
     printf '.'
     sleep 5
 done

guidoffm commented 2 years ago

@matthewfala Yes, this could be a solution. Normally you are not required to build another image since you could provide command and args for the container in the pod. But what is interesting: There is another container with some Java in it and this container does not have this problem. Maybe it simply needs some more time to come up.

matthewfala commented 2 years ago

Thanks @guidoffm . I talked to the team and they propose a different solution. The credential provider will retry if credentials fail to be provided, so what we need is for the IMDS endpoint to not be discoverable.

Is it possible for you to turn off IMDS for your service? Not sure if this would mess things up.

Otherwise, you can add IMDS ip to your container's host file and redirect it to some invalid address

# Add to bottom of /etc/hosts file

# IMDS invalidation
169.254.169.254 192.0.2.0

assuming that 192.0.2.0 is guaranteed to not exist.

matthewfala commented 2 years ago

@guidoffm, closing this issue due to no response. Please reopen this ticket if you have any remaining questions or concerns.

aws / aws-for-fluent-bit

Failed to get IAM credentials from IAM role for service account when running with AppMesh envoy container #321

Cluster Details

Steps to reproduce issue