aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Feature Request: Tracing Support for GRPC Error Codes #256

Open hbaldwi opened 4 years ago

hbaldwi commented 4 years ago

If you want to see App Mesh implement this idea, please upvote with a :+1:.

Tell us about your request I'd like to see better tracing support for GRPC error codes. As it currently stands, traces for GRPC calls returning errors are not marked as erroneous. This could be because the HTTP status for the response is always 200 (irrespective of the specific GRPC error code returned) with Golang's GRPC server implementation. Specifically, I'd like to see the specific GRPC status code and error message in the trace metadata, and for erroneous traces to be marked as erroneous.

Which integration(s) is this request for? App Mesh and X-Ray (or any other tracing backend)

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I'd like to improve the observability of the GRPC calls between our microservices. Currently, App Mesh's exported traces provide good data on latency and retries on a per-request basis, but without error codes and error messages being marked in traces for which GRPC errors were returned, it's difficult to troubleshoot occasional issues. If this feature was added, it'd give us a standardized solution to the issue of RPC observability.

Are you currently working around this issue? Yes. I'm using a combination of Cloudwatch metrics, Envoy statistics, and logs for monitoring.

buzzsurfr commented 4 years ago

@hbaldwi : Thanks for submitting this request.

I'm very interested in improving gRPC usage where we can, and want to track down whether this is something we can change from App Mesh/Envoy or from X-Ray. Can you share some of the traces and/or screenshots that show the trace data from the X-Ray console and highlight the errors?

buzzsurfr commented 4 years ago

Also, I'm not sure what version of Envoy this was added in, but the Envoy docs list that it provides the following data in each trace:

  • GRPC response status and message (if available).
  • An error tag when HTTP status is 5xx or GRPC status is not “OK”.

What version of Envoy are you using for the above? The latest is v1.15.0.

dastbe commented 4 years ago

Looking at it we'd need to find a more purposeful place for grpc output than where they should be going currently. However, since x-ray is both the owner of their schema and their UX we need to work with them to figure out how to update this.

hbaldwi commented 4 years ago

@buzzsurfr I'm using Envoy 1.15.0 for both my virtual gateway and the service's sidecar proxy, but I'm not seeing any indications of the GRPC response status or message in the trace metadata.

Here's the data for a sample trace:

{
    "Duration": 0.002,
    "Id": "1-5f57e025-2397f12545ee4f4ab92136a7",
    "LimitExceeded": false,
    "Segments": [
        {
            "Document": {
                "id": "000b9a2be9bf5611",
                "name": "<removed>",
                "start_time": 1599594533.6091301,
                "trace_id": "1-5f57e025-2397f12545ee4f4ab92136a7",
                "end_time": 1599594533.6107478,
                "parent_id": "000b9a2be9baef37",
                "http": {
                    "request": {
                        "url": "<removed>",
                        "method": "POST",
                        "user_agent": "dart-grpc/2.0.0",
                        "client_ip": "127.0.0.1",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 0
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "<removed>",
                    "request_size": "45",
                    "error": "true",
                    "downstream_cluster": "-",
                    "node_id": "<removed>"
                }
            },
            "Id": "000b9a2be9bf5611"
        },
        {
            "Document": {
                "id": "000b9a2be9baef37",
                "name": "<removed>",
                "start_time": 1599594533.6088219,
                "trace_id": "1-5f57e025-2397f12545ee4f4ab92136a7",
                "end_time": 1599594533.6110733,
                "http": {
                    "request": {
                        "url": "<removed>",
                        "method": "POST",
                        "user_agent": "dart-grpc/2.0.0",
                        "client_ip": "10.229.1.150",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 0
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "<removed>",
                    "request_size": "45",
                    "error": "true",
                    "downstream_cluster": "-",
                    "node_id": "<removed>"
                }
            },
            "Id": "000b9a2be9baef37"
        },
        {
            "Document": {
                "id": "000a654c82917ce4",
                "name": "<removed>",
                "start_time": 1599594533.6094975,
                "trace_id": "1-5f57e025-2397f12545ee4f4ab92136a7",
                "end_time": 1599594533.610285,
                "parent_id": "000b9a2be9bf5611",
                "http": {
                    "request": {
                        "url": "<removed>",
                        "method": "POST",
                        "user_agent": "dart-grpc/2.0.0",
                        "client_ip": "10.229.130.70",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 0
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "<removed>",
                    "request_size": "45",
                    "error": "true",
                    "downstream_cluster": "-",
                    "node_id": "<removed>"
                }
            },
            "Id": "000a654c82917ce4"
        }
    ]
}

The first two hops are the virtual gateway, and the third hop is a GRPC server which returned the GRPC error "Internal". The traces do contain the annotation "error: True", but no faults/errors are marked in the trace, so the trace is not identified as erroneous by X-Ray.

Let me know if there's any other data I can provide that would be helpful.

dastbe commented 4 years ago

Thanks for the example,

From our side, we need to take a look at if we're emitting the grpc tags as would occur here in Envoy

kiranmeduri commented 4 years ago

Based on xray tracer in envoy it seems like grpc tags should be put under annotations. However, it seems like annotations only allow names without . character. So there needs to be some sanitization logic there.

Segment where xray silently drops annotations

aws xray put-trace-segments --trace-segment-documents '{"id":"0000001a0a8aed6f","name":"mesh\/howto-grpc-mesh\/virtualNode\/color_client","start_time":1599867960.790747,"trace_id":"1-5f5c0c38-2c67246d1cbe4e8f81c21c81","end_time":1599867960.7929883,"http":{"request":{"url":"http:\/\/color_server.howto-grpc.local:8080\/color.ColorService\/GetColor","method":"POST","user_agent":"grpc-go\/1.24.0","client_ip":"10.0.121.73","x_forwarded_for":false},"response":{"status":200,"content_length":5}},"annotations":{"response_flags":"-","component":"proxy","upstream_cluster":"cds_egress_howto-grpc-mesh_color_server_grpc_8080","downstream_cluster":"-","request_size":"5","node_id":"mesh\/howto-grpc-mesh\/virtualNode\/color_client","grpc.status_code":"2"}}'
{
    "UnprocessedTraceSegments": []
}

>> Result
...
        {
            "Document": {
                "id": "0000001a0a8aed6f",
                "name": "mesh/howto-grpc-mesh/virtualNode/color_client",
                "start_time": 1599867960.790747,
                "trace_id": "1-5f5c0c38-2c67246d1cbe4e8f81c21c81",
                "end_time": 1599867960.7929883,
                "http": {
                    "request": {
                        "url": "http://color_server.howto-grpc.local:8080/color.ColorService/GetColor",
                        "method": "POST",
                        "user_agent": "grpc-go/1.24.0",
                        "client_ip": "10.0.121.73",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 5
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "upstream_cluster": "cds_egress_howto-grpc-mesh_color_server_grpc_8080",
                    "request_size": "5",
                    "downstream_cluster": "-",
                    "node_id": "mesh/howto-grpc-mesh/virtualNode/color_client"
                }
            },
            "Id": "0000001a0a8aed6f"
        },
...

with sanitized grpc.status_code -> grpc_status_code

aws xray put-trace-segments --trace-segment-documents '{"id":"0000001a0a8aed6a","name":"mesh\/howto-grpc-mesh\/virtualNode\/color_client","start_time":1599867960.790747,"trace_id":"1-5f5c0c38-2c67246d1cbe4e8f81c21c81","end_time":1599867960.7929883,"http":{"request":{"url":"http:\/\/color_server.howto-grpc.local:8080\/color.ColorService\/GetColor","method":"POST","user_agent":"grpc-go\/1.24.0","client_ip":"10.0.121.73","x_forwarded_for":false},"response":{"status":200,"content_length":5}},"annotations":{"response_flags":"-","component":"proxy","upstream_cluster":"cds_egress_howto-grpc-mesh_color_server_grpc_8080","downstream_cluster":"-","request_size":"5","node_id":"mesh\/howto-grpc-mesh\/virtualNode\/color_client","grpc_status_code":"2"}}'
{
    "UnprocessedTraceSegments": []
}

>> Result
...
        {
            "Document": {
                "id": "0000001a0a8aed6a",
                "name": "mesh/howto-grpc-mesh/virtualNode/color_client",
                "start_time": 1599867960.790747,
                "trace_id": "1-5f5c0c38-2c67246d1cbe4e8f81c21c81",
                "end_time": 1599867960.7929883,
                "http": {
                    "request": {
                        "url": "http://color_server.howto-grpc.local:8080/color.ColorService/GetColor",
                        "method": "POST",
                        "user_agent": "grpc-go/1.24.0",
                        "client_ip": "10.0.121.73",
                        "x_forwarded_for": false
                    },
                    "response": {
                        "status": 200,
                        "content_length": 5
                    }
                },
                "annotations": {
                    "response_flags": "-",
                    "component": "proxy",
                    "grpc_status_code": "2",
                    "upstream_cluster": "cds_egress_howto-grpc-mesh_color_server_grpc_8080",
                    "request_size": "5",
                    "downstream_cluster": "-",
                    "node_id": "mesh/howto-grpc-mesh/virtualNode/color_client"
                }
            },
            "Id": "0000001a0a8aed6a"
        },
...
kiranmeduri commented 4 years ago

However, I believe grpc should be given some top-level consideration just like http. Possibly by defining struct under metadata.grpc?

kiranmeduri commented 4 years ago

AWS XRay segment document definition mentions

Keys must be alphanumeric in order to work with filters. Underscore is allowed. Other symbols and whitespace are not allowed.

lavignes commented 4 years ago

Thanks for investigating this @kiranmeduri :)