aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Bug: Envoy proxy failing with AWS STS credential errors #464

Open shekharpalit opened 1 year ago

shekharpalit commented 1 year ago

SECURITY NOTICE: If you think you’ve found a potential security issue, please do not post it in the Issues. Instead, please follow the instructions here or email AWS security directly.

Summary Our application, running on an AWS EKS cluster using AWS AppMesh, is experiencing connectivity issues. The application's pods are not able to reach out to the internet. We have set the egressFilter in AppMesh to ALLOW_ALL and the service account attached to the pods has the necessary IAM policies (AWSCloudMapFullAccess, AWSAppMeshFullAccess, and AWSAppMeshEnvoyAccess) associated.

When checking the logs of the Envoy proxy, we observed the following error message:

[2023-06-01 00:43:50.278][21][error][aws] [source/extensions/common/aws/credentials_provider_impl.cc:279] Could not load AWS credentials document from STS
[2023-06-01 00:43:50.288][15][warning][config] [./source/common/config/grpc_stream.h:163] StreamAggregatedResources gRPC config stream to appmesh-envoy-management.us-east-1.amazonaws.com:443 closed: 7, Unauthorized to perform appmesh:StreamAggregatedResources for arn:aws:appmesh:us-east-1:513177627844:mesh/example/virtualNode/user-management-service-virtual-node.

We have tried several troubleshooting steps, including verifying the IAM policies, IAM role's trust relationship, service account assignments, system time on the EKS nodes, and more, but the issue persists

The aicronaut app we are trying to run inside the pod after we activate the mesh

{"timeMillis":1685493940463,"thread":"main","level":"ERROR","loggerName":"io.micronaut.runtime.Micronaut","message":"Error starting Micronaut server: Unable to execute HTTP request: Network is unreachable","thrown":{"commonElementCount":0,"localizedMessage":"Unable to execute HTTP request: Network is unreachable","message":"Unable to execute HTTP request: Network is unreachable","name":"software.amazon.awssdk.core.exception.SdkClientException","cause":{"commonElementCount":45,"localizedMessage":"Network is unreachable","message":"Network is unreachable","name":"java.net.SocketException","extendedStackTrace":"java.net.SocketException: Network is unreachable\n\tat sun.nio.ch.Net.connect0(Native Method) ~[?:?]\n\tat sun.nio.ch.Net.connect(Net.java:579) ~[?:?]\n\tat sun.nio.ch.Net.connect(Net.java:568) ~[?:?]\n\tat sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588) ~[?:?]\n\tat java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327) ~[?:?]\n\tat java.net.Socket.connect(Socket.java:633) ~[?:?]\n\tat org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368) ~[httpclient-4.5.13.jar:4.5.13]\n\tat software.amazon.awssdk.http.apache.internal.conn.SdkTlsSocketFactory.connectSocket(SdkTlsSocketFactory.java:65) ~[apache-client-2.20.51.jar:?]\n\tat 

This is my yaml file which creates the virtual services, router, nodes

---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: {{ include "helm.fullname" . }}-vn
  namespace: {{ .Release.Namespace }}
spec:
  awsName: {{ include "helm.fullname" . }}-virtual-node
  podSelector:
    matchLabels:
      app: {{ include "helm.name" . }}
  listeners:
    - portMapping:
        port: {{ .Values.service.port }}
        protocol: http
  serviceDiscovery:
    dns:
      hostname: {{ include "helm.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  namespace: {{ .Release.Namespace }}
  name: {{ include "helm.fullname" . }}-vr
spec:
  awsName: {{ include "helm.fullname" . }}-virtual-router
  listeners:
    - portMapping:
        port: {{ .Values.service.port }}
        protocol: http
  routes:
    - name: {{ include "helm.fullname" . }}-route
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: {{ include "helm.fullname" . }}-vn
              weight: 1
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualService
metadata:
  namespace: {{ .Release.Namespace }}
  name: {{ include "helm.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local
spec:
  awsName: {{ include "helm.fullname" . }}.{{ .Release.Namespace }}.svc.cluster.local
  provider:
    virtualRouter:
      virtualRouterRef:
        name: {{ include "helm.fullname" . }}-vr

and this is my serviceaccount.yaml file which I am using in the helm

apiVersion: v1
kind: ServiceAccount
metadata:
  name: {{ include "helm.fullname" . }}
  namespace: {{ .Release.Namespace }}
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::XXXX:role/eksctl-eks-addon-iamserviceaccount-default-d-Role1-XXXX

this is enabled in my deployment.yaml file in helm

  template:
    metadata:
      annotations:
        appmesh.k8s.aws/sidecarInjectorWebhook: enabled

Note:

When we deploy the service without the mesh, it deploys successfully 

Steps to Reproduce

Expected behavior The application should be able to reach out to the internet and not present any STS credential-related errors in the Envoy logs.

Actual behavior The application fails to reach the internet and the Envoy logs present STS credential-related errors.

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

suniltheta commented 1 year ago

Can you please create a Support ticket for this issue? The issue seems specific to your setup.

shekharpalit commented 1 year ago

Can you please create a Support ticket for this issue? The issue seems specific to your setup.

support is not being helpful here can you please guide me what I am missing here and how to resolve this issue ?

suniltheta commented 1 year ago

We need to understand why it is failing to load AWS credentials document from STS. Can you enable debug logs to know more details around why it fails?

Sometimes the AWS_WEB_IDENTITY_TOKEN_FILE will be missing if AWS_ROLE_ARN is manually specified. By design the EKS pod identity webhook will not overwrite customer-defined AWS_ROLE_ARN/AWS_WEB_IDENTITY_TOKEN_FILE.” https://github.com/aws/amazon-eks-pod-identity-webhook/blob/master/pkg/handler/handler.go#L142-L154

AWS Troubleshooting docs: https://docs.aws.amazon.com/app-mesh/latest/userguide/troubleshooting-kubernetes.html#ts-kubernetes-irsa-not-working

This is just one known issue, but not sure what it is in your case. So through support ticket we would be able to get into the details of the issue. Can you please let me know if you already have an open ticket for this issue?