aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

X-Ray Service map not showing call from one micro service to another AWS EKS Fargate setup with App Mesh #272

Closed nitinkapur closed 3 years ago

nitinkapur commented 4 years ago

I have a AWS EKS Fargate setup and have an ingress Gateway setup that makes call to two micro services. The service are Pemissions and Service-Providers.

X-Ray Service Map does show the calls are made correctly.

image

So, the call goes from Virtual gateway to permissions and service-providers and is shown on service map.

However, there is another call that happens between these two services themselves that is not routed from the gateway, where service providers contacts permissions to find out whether the user who made the call has permissions to get that data or not depending on his/her auth. token.

This particular call does not shows up in service map. As you can see there is no line connecting the two services. Although in traces I do see that call

image

The permissions service is referenced as http://ganesh-permissions in the service providers service. What am I missing?

Ingress Gateway X-ray logs

2020-10-12T17:21:51Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-12T17:21:52Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-12T17:21:54Z [Info] Successfully sent batch of 2 segments (0.013 seconds)
2020-10-12T17:21:57Z [Info] Successfully sent batch of 1 segments (0.007 seconds)
2020-10-12T17:22:04Z [Info] Successfully sent batch of 1 segments (0.006 seconds)
2020-10-12T17:22:06Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-12T17:22:07Z [Info] Successfully sent batch of 1 segments (0.016 seconds)
2020-10-12T17:22:09Z [Info] Successfully sent batch of 2 segments (0.013 seconds)
2020-10-12T17:22:12Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-12T17:22:19Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-12T17:22:21Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-12T17:22:22Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-12T17:22:24Z [Info] Successfully sent batch of 1 segments (0.004 seconds)

service-providers logs

2020-10-12T16:51:07Z [Info] Initializing AWS X-Ray daemon 3.2.0
2020-10-12T16:51:07Z [Info] Using buffer memory limit of 74 MB
2020-10-12T16:51:07Z [Info] 1184 segment buffers allocated
2020-10-12T16:51:07Z [Info] Using region: us-east-1
2020-10-12T16:52:08Z [Error] Get instance id metadata failed: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/instance-id: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-10-12T16:52:08Z [Info] HTTP Proxy server using X-Ray Endpoint : https://xray.us-east-1.amazonaws.com
2020-10-12T16:52:08Z [Info] Starting proxy http server on 0.0.0.0:2000
2020-10-12T16:52:10Z [Info] Successfully sent batch of 1 segments (0.285 seconds)
2020-10-12T16:52:10Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-12T17:05:53Z [Info] Successfully sent batch of 1 segments (0.012 seconds)
2020-10-12T17:14:02Z [Info] Successfully sent batch of 1 segments (0.016 seconds)

permissions logs

2020-10-15T05:06:48Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:48Z [Info] Successfully sent batch of 50 segments (0.012 seconds)
2020-10-15T05:06:48Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:48Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:48Z [Info] Successfully sent batch of 50 segments (0.011 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.009 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.011 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.011 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.008 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:49Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:50Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-15T05:06:50Z [Info] Successfully sent batch of 50 segments (0.014 seconds)
2020-10-15T05:06:50Z [Info] Successfully sent batch of 50 segments (0.011 seconds)
2020-10-15T05:06:50Z [Info] Successfully sent batch of 50 segments (0.012 seconds)
2020-10-15T05:06:50Z [Info] Successfully sent batch of 50 segments (0.009 seconds)
2020-10-15T05:06:50Z [Info] Successfully sent batch of 50 segments (0.008 seconds)
2020-10-15T05:06:51Z [Info] Successfully sent batch of 50 segments (0.009 seconds)
2020-10-15T05:06:51Z [Info] Successfully sent batch of 50 segments (0.010 seconds)

This is how the services are configured

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualGateway
metadata:
  name: ingress-gw
  namespace: dev
spec:
  namespaceSelector:
    matchLabels:
      gateway: ingress-gw
  podSelector:
    matchLabels:
      app: ingress-gw
  listeners:
    - portMapping:
        port: 8088
        protocol: http
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: GatewayRoute
metadata:
  name: ganesh-permissions
  namespace: dev
spec:
  httpRoute:
    match:
      prefix: "/ganesh-permissions/dev"
    action:
      target:
        virtualService:
          virtualServiceRef:
            name: ganesh-permissions
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualService
metadata:
  name: ganesh-permissions
  namespace: dev
spec:
  awsName: ganesh-permissions.dev.svc.cluster.local
  provider:
    virtualRouter:
      virtualRouterRef:
        name: ganesh-permissions
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  namespace: dev
  name: ganesh-permissions
spec:
  listeners:
    - portMapping:
        port: 80
        protocol: http
  routes:
    - name: ganesh-permissions-route
      priority: 10
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: ganesh-permissions-vnode
              weight: 1
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: GatewayRoute
metadata:
  name: service-providers
  namespace: dev
spec:
  httpRoute:
    match:
      prefix: "/service-providers/dev"
    action:
      target:
        virtualService:
          virtualServiceRef:
            name: service-providers
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualService
metadata:
  name: service-providers
  namespace: dev
spec:
  awsName: service-providers.dev.svc.cluster.local
  provider:
    virtualRouter:
      virtualRouterRef:
        name: service-providers
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  namespace: dev
  name: service-providers
spec:
  listeners:
    - portMapping:
        port: 80
        protocol: http
  routes:
    - name: service-providers-route
      priority: 10
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: service-providers-vnode
              weight: 1
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: ganesh-permissions-vnode
  namespace: dev
spec:
  podSelector:
    matchLabels:
      app: ganesh-permissions
  listeners:
    - portMapping:
        port: 80
        protocol: http
  serviceDiscovery:
    dns:
      hostname: ganesh-permissions.dev.svc.cluster.local
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: service-providers-vnode
  namespace: dev
spec:
  podSelector:
    matchLabels:
      app: service-providers
  listeners:
    - portMapping:
        port: 80
        protocol: http
  backends:
    - virtualService:
        virtualServiceRef:
          name: ganesh-permissions
  serviceDiscovery:
    dns:
      hostname: service-providers.dev.svc.cluster.local
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: worklink
  namespace: dev
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: ingress-gw
          servicePort: 80
        path: /*
---
apiVersion: v1
kind: Service
metadata:
  name: ingress-gw
  namespace: dev
spec:
  ports:
  - port: 80
    targetPort: 8088
    protocol: TCP
  type: NodePort
  selector:
    app: ingress-gw
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingress-gw
  namespace: dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ingress-gw
  template:
    metadata:
      labels:
        app: ingress-gw
    spec:
      containers:
        - name: envoy
          image: 840364872350.dkr.ecr.region-code.amazonaws.com/aws-appmesh-envoy:v1.15.1.0-prod
          ports:
            - containerPort: 8088
      serviceAccountName: worklink-dev-sa
      securityContext:
        fsGroup: 65534
---
lavignes commented 4 years ago

Just to sanity check.

  1. Are the two services instrumented with the X-Ray SDK?
  2. In your service map is ganesh-permissions the lower virtual node?
  3. I'm curious if all three services saw the same Get instance id metadata failed: error at startup. This probably isn't an issue but the logs above seem to be taken from arbitrary points in time on different days. Are you able to reproduce this and provide logs for the same day/time?
nitinkapur commented 4 years ago
  1. Are the two services instrumented with the X-Ray SDK? The two services are runnings as pods in a AWS EKS cluster. X-ray daemon is running through a side car injection as a part of this part and sending traces.

2.In your service map is ganesh-permissions the lower virtual node? By lower virtual node you mean that it is being called from service-providers then yes. Ganesh-permissions doesn't calls any other service but is called from other services to check the logged in client's permissions.

  1. I'm curious if all three services saw the same Get instance id metadata failed: error at startup. This probably isn't an issue but the logs above seem to be taken from arbitrary points in time on different days. Are you able to reproduce this and provide logs for the same day/time? I am posting the latest logs below.

ganesh-permissions

2020-10-23T03:11:22Z [Info] Successfully sent batch of 50 segments (0.026 seconds)
2020-10-23T03:11:22Z [Info] Successfully sent batch of 50 segments (0.026 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.024 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.021 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.043 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.009 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-23T03:11:23Z [Info] Successfully sent batch of 50 segments (0.012 seconds)
2020-10-23T03:11:24Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-23T03:11:24Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-23T03:11:24Z [Info] Successfully sent batch of 50 segments (0.010 seconds)
2020-10-23T03:11:24Z [Info] Successfully sent batch of 50 segments (0.011 seconds)

service-providers

2020-10-12T16:51:07Z [Info] Initializing AWS X-Ray daemon 3.2.0
2020-10-12T16:51:07Z [Info] Using buffer memory limit of 74 MB
2020-10-12T16:51:07Z [Info] 1184 segment buffers allocated
2020-10-12T16:51:07Z [Info] Using region: us-east-1
2020-10-12T16:52:08Z [Error] Get instance id metadata failed: RequestError: send request failed
caused by: Get http://169.254.169.254/latest/meta-data/instance-id: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2020-10-12T16:52:08Z [Info] HTTP Proxy server using X-Ray Endpoint : https://xray.us-east-1.amazonaws.com
2020-10-12T16:52:08Z [Info] Starting proxy http server on 0.0.0.0:2000
2020-10-12T16:52:10Z [Info] Successfully sent batch of 1 segments (0.285 seconds)
2020-10-12T16:52:10Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-12T17:05:53Z [Info] Successfully sent batch of 1 segments (0.012 seconds)
2020-10-12T17:14:02Z [Info] Successfully sent batch of 1 segments (0.016 seconds)
2020-10-12T22:55:42Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-12T22:55:47Z [Info] Successfully sent batch of 1 segments (0.012 seconds)
2020-10-12T23:22:23Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-12T23:22:30Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-13T14:13:10Z [Info] Successfully sent batch of 1 segments (0.012 seconds)
2020-10-14T14:12:08Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-14T15:53:48Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-15T05:01:55Z [Info] Successfully sent batch of 1 segments (0.011 seconds)
2020-10-15T05:02:35Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-15T05:09:52Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-15T05:09:56Z [Info] Successfully sent batch of 1 segments (0.012 seconds)
2020-10-15T05:10:43Z [Info] Successfully sent batch of 1 segments (0.004 seconds)

Ingress gatgeway

2020-10-23T03:14:47Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-23T03:14:55Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-23T03:14:56Z [Info] Successfully sent batch of 1 segments (0.013 seconds)
2020-10-23T03:15:02Z [Info] Successfully sent batch of 1 segments (0.015 seconds)
2020-10-23T03:15:10Z [Info] Successfully sent batch of 1 segments (0.005 seconds)
2020-10-23T03:15:17Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-23T03:15:26Z [Info] Successfully sent batch of 2 segments (0.005 seconds)
2020-10-23T03:15:32Z [Info] Successfully sent batch of 1 segments (0.004 seconds)
2020-10-23T03:15:41Z [Info] Successfully sent batch of 2 segments (0.005 seconds)
2020-10-23T03:15:47Z [Info] Successfully sent batch of 1 segments (0.005 seconds)
2020-10-23T03:15:56Z [Info] Successfully sent batch of 1 segments (0.014 seconds)
2020-10-23T03:16:02Z [Info] Successfully sent batch of 1 segments (0.005 seconds)

Also looks like this error can be ignored. https://github.com/aws/aws-app-mesh-examples/issues/141

lavignes commented 4 years ago

By lower virtual node I meant the lower virtual node in the service map image you provided. The names of the virtual nodes are cut off so it is not clear which is which in your example. I'm curious since we see traffic originating from an unknown client into the lower node. I'm curious if the traffic between the 2 virtual nodes is being classified as the unknown client in this case.

lavignes commented 4 years ago

The two services are runnings as pods in a AWS EKS cluster. X-ray daemon is running through a side car injection as a part of this part and sending traces.

So it sounds like you made no code modifications to the services themselves. Generally, it is recommended that you also instrument your application with the X-Ray SDK so trace metadata from incoming requests is propagated to outgoing requests.

By not doing this traces that originate at your virtual gateway Envoy will stop once they reach a virtual node's Envoy.

That said, new traces between the services should also be emitted but they will not be able to be related to any requests between the gateway and virtual nodes.

nitinkapur commented 4 years ago

Yes, the lower virtual node is ganesh-permissions. Yes, the traffic originating from an unknown client is actually originating from service-providers which is calling ganesh-permissions to get roles. Instead of showing it from service-providers an unknown client is being used to represent the traffic flow.

Here is what I see in traces

image

lavignes commented 4 years ago

Why is it that the log for service-providers spans over the course of 3 days (none from today) and the other logs are from today?

Is there perhaps something else wrong with the service and the daemon is no longer functioning?

lavignes commented 4 years ago

Thanks for replying to my questions by the way. It's super helpful.

My current theory is that maybe outbound requests are not being intercepted by Envoy (and thus not being sent to the X-Ray daemon) or perhaps there is a bug in Envoy that is not attaching the proper trace metadata to the requests when the applications are not instrumented with the SDK.

I'll try and reproduce this myself in the morning (PDT time) since this should be enough detail.

nitinkapur commented 4 years ago

There is the bug in the services itself(that is another issue we are dealing with) but it is not related to X-Ray. What is happening that the service-providers after the first request keeps on hitting ganesh-permissions non-stop with the same request. Looks like the programmer forgot to close the call. That is the reason service-providers logs might be looking stale. But X-Ray should still show the request coming to ganesh-permissions from the service-provider and not some unknown client.

nitinkapur commented 4 years ago

One more point in this presentation they are able to get the x-ray daemon working and the requests does seems to be coming from the correct micro services instead of unknown client in x-ray.

https://www.youtube.com/watch?v=NFpWnHE1Ckw&t=2204s

you can skip to 36:45 if you don't want to go through the entire presentation. Their setup is similar to mine except that my eks is on fargate.

lavignes commented 3 years ago

Just wanted to update to make sure you didn't think we forgot about you.

I'm still investigating this but thinking it may actually be a bug in Envoy. Could you try using an older version of Envoy in your mesh and see what happens?

You should try v1.12.5 since I'm assuming you are using v1.15.1 right now:

http://840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.12.5.0-prod

nitinkapur commented 3 years ago

Thanks! for helping me with this.

I can deploy the envoy used on the gateway to an older version(1.12.5.0) which I tried and the results are the same. But I cannot go back to an older version on envoy proxy running as a side car as it is injected automatically to the latest version. In EKS we just need to label the namespace with a label to inject an envoy proxy to the pod, running the application container. The envoy proxy and application containers then run on the same pod.

apiVersion: v1
kind: Namespace
metadata:
  name: dev
  labels:
    mesh: worklink-mesh-dev
    gateway: ingress-gw
    appmesh.k8s.aws/sidecarInjectorWebhook: enabled

No matter what I do I still keep on getting the request shown from an unknown client. I even tried to ssh into the container for service-providers running inside the pod and curl to the ganesh-permissions, it still shows the request coming from an unknown client.

image

image

This how the services are set up

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: ganesh-permissions
  namespace: dev
spec:
  podSelector:
    matchLabels:
      app: ganesh-permissions
  listeners:
    - portMapping:
        port: 80
        protocol: http
  serviceDiscovery:
    dns:
      hostname: ganesh-permissions.dev.svc.cluster.local
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  namespace: dev
  name: ganesh-permissions
spec:
  listeners:
    - portMapping:
        port: 80
        protocol: http
  routes:
    - name: ganesh-permissions-route
      priority: 10
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: ganesh-permissions
              weight: 1
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualService
metadata:
  name: ganesh-permissions
  namespace: dev
spec:
  awsName: ganesh-permissions.dev.svc.cluster.local
  provider:
    virtualRouter:
      virtualRouterRef:
        name: ganesh-permissions
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: service-providers
  namespace: dev
spec:
  podSelector:
    matchLabels:
      app: service-providers
  listeners:
    - portMapping:
        port: 80
        protocol: http
  serviceDiscovery:
    dns:
      hostname: service-providers.dev.svc.cluster.local
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  namespace: dev
  name: service-providers
spec:
  listeners:
    - portMapping:
        port: 80
        protocol: http
  routes:
    - name: service-providers-route
      priority: 10
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: service-providers
              weight: 1
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualService
metadata:
  name: service-providers
  namespace: dev
spec:
  awsName: service-providers.dev.svc.cluster.local
  provider:
    virtualRouter:
      virtualRouterRef:
        name: service-providers

This is how gateway has been setup

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualGateway
metadata:
  name: ingress-gw
  namespace: dev
spec:
  namespaceSelector:
    matchLabels:
      gateway: ingress-gw
  podSelector:
    matchLabels:
      app: ingress-gw
  listeners:
    - portMapping:
        port: 8088
        protocol: http
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: GatewayRoute
metadata:
  name: ganesh-permissions
  namespace: dev
spec:
  httpRoute:
    match:
      prefix: "/ganesh-permissions/dev"
    action:
      target:
        virtualService:
          virtualServiceRef:
            name: ganesh-permissions
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: GatewayRoute
metadata:
  name: service-providers
  namespace: dev
spec:
  httpRoute:
    match:
      prefix: "/service-providers/dev"
    action:
      target:
        virtualService:
          virtualServiceRef:
            name: service-providers
---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: worklink
  namespace: dev
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internal
spec:
  rules:
  - http:
      paths:
      - backend:
          serviceName: ingress-gw
          servicePort: 80
        path: /*
---
apiVersion: v1
kind: Service
metadata:
  name: ingress-gw
  namespace: dev
spec:
  ports:
  - port: 80
    targetPort: 8088
    protocol: TCP
  type: NodePort
  selector:
    app: ingress-gw
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ingress-gw
  namespace: dev
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ingress-gw
  template:
    metadata:
      labels:
        app: ingress-gw
    spec:
      containers:
        - name: envoy
          image: 840364872350.dkr.ecr.us-west-2.amazonaws.com/aws-appmesh-envoy:v1.12.5.0-prod
          ports:
            - containerPort: 8088
      serviceAccountName: worklink-dev-sa
      securityContext:
        fsGroup: 65534
---
fawadkhaliq commented 3 years ago

@nitinkapur you can override the default envoy version using the Helm config here [1]. The config params are sidecar.image.repository and sidecar.image.tag

[1] https://github.com/aws/eks-charts/tree/master/stable/appmesh-controller#configuration

nitinkapur commented 3 years ago

Thanks! @fawadkhaliq so I downgraded to an earlier version of envoy but still it doesn't shows requests coming from service-providers and shows that it is instead coming from an unknown client.

lavignes commented 3 years ago

@nitinkapur Ok so that is good/bad since I was thinking this may be a regression. Now I'm not sure.

I'll continue to investigate this to make sure. But it is recommended that you instrument your applications with an X-Ray SDK (i.e. make changes to the applications themselves). The example you referenced above with a "color-app" most likely is using the X-Ray SDK.

Example: https://github.com/aws/aws-app-mesh-examples/blob/b46e509f9cb31418c14a627a34c755cd0b8cba10/examples/apps/colorapp/src/gateway/main.go#L125

Docs on instrumenting application with X-Ray: https://docs.aws.amazon.com/xray/latest/devguide/aws-xray.html

nitinkapur commented 3 years ago

So I have figured out the issue. On virtual nodes if we are calling any other virtual service then it should be properly defined in the virtual node to allow egress traffic. Otherwise it does not trace the call coming from one service to another

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: service-providers
  namespace: dev
spec:
  podSelector:
    matchLabels:
      app: service-providers
  listeners:
    - portMapping:
        port: 80
        protocol: http
  backends:
    - virtualService:
        virtualServiceRef:
          name: ganesh-permissions
  serviceDiscovery:
    dns:
      hostname: service-providers.dev.svc.cluster.local

image

Thanks! for helping me out with this. This issue can be closed.

lavignes commented 3 years ago

@nitinkapur great to see that its working. But in your original example above, you did have the permissions service as a backend:

taken from above (the start of the issue)

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: service-providers-vnode
  namespace: dev
spec:
  podSelector:
    matchLabels:
      app: service-providers
  listeners:
    - portMapping:
        port: 80
        protocol: http
  backends:
    - virtualService:
        virtualServiceRef:
          name: ganesh-permissions
  serviceDiscovery:
    dns:
      hostname: service-providers.dev.svc.cluster.local

I just want to make sure that was the real issue before closing this out.

nitinkapur commented 3 years ago

Yes right. I am not sure now what was wrong with my earlier implementation. I went on making so many changes to the virtual services, virtual nodes and virtual routers to make it work. Then the last change I made was to add the backend after which it started working(I might have removed the backend from there earlier while debugging) so I thought that was the issue.

I have deployed all my other services now. However it is still not perfect. I am not seeing any faults. X-Ray service map should should show faults in Red but it is not showing up although in the traces I can see 500 faults.

image

image

Apart from this I have two more question. How do I include the RDS database and Redis Elasticache services to be a part of this mesh so that I can have end to end visibility? Why do I still see unknown client from some of the services and not for some other services?

lavignes commented 3 years ago

X-Ray service map should should show faults in Red

To be honest, there isn't enough data here to know for sure. I don't know what host is sending faults/what client is receiving them. The cropped image with response codes doesn't tell us much. If these requests are to remote services I do not believe you will see yellow/red on the service map unless you instrument your applications with the X-Ray SDK.

How do I include the RDS database and Redis Elasticache services to be a part of this mesh so that I can have end to end visibility?

You'll want to instrument your applications with the X-Ray SDK as I've mentioned. Outbound requests from your applications will not be handled properly by X-Ray unless you do so. The X-Ray documentation has examples on how to add tracing to AWS SDK requests and SQL queries in multiple languages: https://docs.aws.amazon.com/xray/latest/devguide

Why do I still see unknown client from some of the services and not for some other services?

I am unsure. Your service map is showing more services than before so I don't know what you have deployed. I will say that the traces from the Clients in your service map should contain the request metadata showing their source.

I am unable to reproduce the behavior you are seeing so I don't think I can give much insight unless you can provide an example application or clear steps that we can follow to reproduce your issue.

lavignes commented 3 years ago

I'm going to close this out now since the bug being reported appears to have been a misconfiguration issue.

If you can reproduce the bug and can provide steps, then feel free to reopen. If you have other questions, feel free to reach out on our Slack community as well: https://github.com/aws/aws-app-mesh-roadmap#slack-community