aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Bug: Missing x-amzn-trace-id in response headers #394

Open mkielar opened 2 years ago

mkielar commented 2 years ago

Summary Envoy doesn't add x-amzn-trace-id header to response headers if it's missing. This way, proper serving on x-amzn-trace-id header fully relies on whether the application container uses XRAY SDK or is otherwise configured to propagate x-amzn-trace-id header from Request to Response.

Steps to Reproduce

  1. Deploy an ECS Service (let's call it svc-a) with vanilla Nginx (use nginx as image in containerDefinition). Add Envoy and X-Ray Sidecar, and enable Envoy <=> X-Ray Integration with ENABLE_ENVOY_XRAY_TRACING=1.
  2. Deploy another, identical ECS Service (let's call it svc-b), and point to svc-a in Virtual Node Backends Configuration, to let svc-b Envoy know there's an integration between the two.
  3. Use ECS Exec to SSH to svc-b Nginx container
  4. From isnide of svc-b Nginx container, make a HTTP Request to svc-a. Like this: curl -v http://svc-a.dev.local/foo1234
  5. The response does not contain x-amzn-trace-id HTTP Header
    $ curl -v http://svc-a.dev.local
    *   Trying 10.130.119.69:80...
    * Connected to svc-a.dev.local (10.130.119.69) port 80 (#0)
    > GET /foo1234 HTTP/1.1
    > Host: svc-a.dev.local
    > User-Agent: curl/7.76.1
    > Accept: */*
    >
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < date: Wed, 02 Mar 2022 10:43:41 GMT
    < server: envoy
    < x-envoy-upstream-service-time: 4
    < transfer-encoding: chunked
  6. Observe, that the x-amzn-trace-id header is missing
  7. Navigate to X-Ray Console in AWS Console, and check traces for svc-a.
  8. Observe, that the Trace for the /foo1234 request has been created.

Are you currently working around this issue? We use XRAY SDK to instrument our backends. It works well with Python / Flask applications, but somehow fails to work with .NET applications (we're investigating the issue, but that's how we found out). The Nginx is just an example, but it shows that things can get difficult when the backend application is a closed-source third party and there may be no ways to enforce propagating headers.

Additional context Envoy version: v1.20.0.1-prod

suniltheta commented 2 years ago

Hi @mkielar thanks for raising this bug,

Some clarifying questions:

  1. Is xray integration enabled in svc-b's envoy sidecar ?
  2. Does the trace seen on X-Ray console say Origin = AWS::AppMesh::Proxy or is it coming from Xray SDK ?
  3. Just making sure that the App Mesh Virtual Node listener protocol is not set as TCP. Ref: docs
mkielar commented 2 years ago

Hi @suniltheta,

  1. Is xray integration enabled in svc-b's envoy sidecar ?

Yes, it is. As a matter of fact, I tested this with following configurations:

I then additionally tested some of the configs, replacing X-Ray Sidecar with AWS Distro for Open Telemetry Sidecar with configured X-Ray Receiver / Exporter. In all of the cases I got the header back only if the X-Ray SDK was properly configured and was adding it to the response from the App Container. If the X-Ray SDK was misconfigured (our .NET apps) or missing (pure Nginx container) then the header was missing in the response.

  1. Does the trace seen on X-Ray console say Origin = AWS::AppMesh::Proxy or is it coming from Xray SDK ?

It says AWS::AppMesh::Proxy. I cannot post screenshots, because the setup I'm working on is not really svc-a / svc-b but our production system and I'm under NDA, but I can confirm that the visualizations on X-Ray UI show all elements of the call-chain correctly.

  1. Just making sure that the App Mesh Virtual Node listener protocol is not set as TCP. Ref: docs

It's not. These are all HTTP Services and we have build reusable Terraform modules to deploy our ECS Services, so they all (Python / .NET / Nginx) get exactly the same configuration of Virtual Nodes / Services / Routes.

mkielar commented 2 years ago

OK, I think I can actually present a screenshot.

This one shows a trace for svc-b (being an ECS Fargate Services running a pure Nginx, with Envoy Sidecar integrated with OpenTelemetry Sidecar running X-Ray Pipeline), being accessed by curl from an EC2 that does not have Envoy installed. image

Then, the next one is a trace for svc-b (same config as above) being accessed by curl executed from an Application Container running in ECS Fargate, after connecting to it with ECS Exec. The Application Container is part of an ECS Task which runs Envoy integrated with X-Ray Agent. The name of the initiating service had to be obfuscated because of the NDA, but otherwise it shows the expected set of components. image

suniltheta commented 2 years ago

I was able to recreate the issue using Xray & Jaeger tracing. I believe the same behavior is common for other tracer as well.

Below I used Jaeger as trace collector for Zipkin format.

In below call the backend is made is include x-b3-* headers from request to response. i.e., instrument the backend.

 sunnrs@3c22fb1a7644  ~/projects/suniltheta/aws-app-mesh-examples/walkthroughs/howto-k8s-alb   main ●  curl -v k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com/color
*   Trying 54.244.188.207...
* TCP_NODELAY set
* Connected to k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com (54.244.188.207) port 80 (#0)
> GET /color HTTP/1.1
> Host: k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 04 Mar 2022 16:48:08 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< server: envoy
< x-b3-traceid: 3ebe5fd2589f18da
< x-b3-spanid: 318f1d4b6dcb0364
< x-b3-parentspanid: 3ebe5fd2589f18da
< x-b3-sampled: 1
< x-b3-flags: None
< b3: None
< x-envoy-upstream-service-time: 0
< 
* Connection #0 to host k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com left intact
None* Closing connection 0

Below call the backend is not made is include x-b3-* headers from request to response. i.e., not instrumenting the backend.

 sunnrs@3c22fb1a7644  ~/projects/suniltheta/aws-app-mesh-examples/walkthroughs/howto-k8s-alb   main ●  curl -v k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com/color1
*   Trying 54.214.180.88...
* TCP_NODELAY set
* Connected to k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com (54.214.180.88) port 80 (#0)
> GET /color1 HTTP/1.1
> Host: k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com
> User-Agent: curl/7.64.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Date: Fri, 04 Mar 2022 16:48:10 GMT
< Transfer-Encoding: chunked
< Connection: keep-alive
< server: envoy
< x-envoy-upstream-service-time: 0
< 
* Connection #0 to host k8s-howtok8s-color-63786f35e6-501468261.us-west-2.elb.amazonaws.com left intact
None* Closing connection 0

I believe this is not a bug on App Mesh side or not even the xray extension.

If we refer the envoy code where the headers are inject, it is only on the request path. On the response path the headers are not injected again. If the application doesn’t instrument the tracing then the response will not contain the necessary headers. So it will be the onus of the application/SDK to pass the header from request to response.

https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/xray/tracer.cc#L102 https://github.com/envoyproxy/envoy/blob/main/source/extensions/tracers/zipkin/zipkin_tracer_impl.cc#L42

suniltheta commented 2 years ago

Checking the envoy debug logs to see the headers x-amzn-trace-id logged and correlating with the request definitely defeats the purpose of having the tracing enabled. But there is not much we can do :(

mkielar commented 2 years ago

So, If I understand you correctly, it's the Envoy internal implementation that prevents any traceing plugin from enriching the response, correct? In that case do you suggest I'd rather report that as a Feature Request for https://github.com/envoyproxy/envoy?

suniltheta commented 2 years ago

It is true that this has to come from Envoy itself. This has to do with design decision of http tracers in envoy.

mkielar commented 2 years ago

@suniltheta Thanks for explanation, and further research. I have reported this for Envoy to review, hopefully they'll find my argumentation convincing. That said, I'm not sure what to do with this ticket. Should we leave it open for you to implement any require improvements to x-ray tracer extension once Envoy introduces API that allows for response-header enrichment?

suniltheta commented 2 years ago

Hi @mkielar thanks for opening the discussion on the envoy github.

We can decide the outcome of this issue based on what the community decides, depending on whether we move ahead with including the trace headers in the response or not. Meanwhile I will mark this issue as blocked on envoy fix.