aws / aws-app-mesh-roadmap

AWS App Mesh is a service mesh that you can use with your microservices to manage service to service communication
Apache License 2.0
347 stars 25 forks source link

Feature Request: Support separate HTTP Codes in RetryPolicy #451

Open mkielar opened 1 year ago

mkielar commented 1 year ago

If you want to see App Mesh implement this idea, please upvote with a :+1:.

Tell us about your request When https://github.com/aws/aws-app-mesh-roadmap/issues/7 was first published, there were some bold statements about supporting configuration of retryPolicy where one could specify HTTP Codes separately by passing values like 4xx, 5xx and so on, alongside gateway-error and others. See this excerpt from https://github.com/aws/aws-app-mesh-roadmap/issues/7:

        ...
        "http": [
            "server-error",          // retriable_status_codes: [ "500", "501", "505", "506", "507", 508", "509", "510", "511" ]
            "gateway-error",         // retriable_status_codes: [ "502", "503, "504" ]
            "client-error" ,         // retriable_status_codes: [ "409" ]
            "stream-error",          // retry_on: refused-stream (h2)
            "<1xx|2xx|3xx|4xx|5xx>", // retriable_status_codes: "xx" is expanded to valid IANA HTTP codes
        ],
        ...

and a quote, which sounds like somebody was really eager to implement this:

Most interesting event in the schema is the HTTP code expansion field. If field such as “1xx” is added to the list of http events to retry on, “xx” will be expanded to a full list of IANA supported HTTP codes.

However, in the final release none of this seems to be available, and the attempt to pass values like 5xx or even explicit 502 results in below error (this is a terraform error, but it looks like a fault returned by AWS API):

Error: error updating App Mesh route: BadRequestException: HttpEvent must be one of [server-error,gateway-error,client-error,stream-error,reset].

This is a feature request, to:

  1. Implement the feature as it was designed in https://github.com/aws/aws-app-mesh-roadmap/issues/7
  2. More importantly
    • implement the possibility to pass separate HTTP Codes
    • implement some "special" indicator to explicitly allow retry-on-timeout (e.g. timeout).

Which integration(s) is this request for? We're using AppMesh with ECS Fargate Services, but this is not relevant. The configuration should be available for any kind of service.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? We are wrapping several legacy, 3rd party applications with weird and obsolete APIs with AppMesh. We're provisioning Virtual Nodes, Services, Routers, and Router Routes for them. In Router Routes we set up different retry/timeout configurations for our internal clients (AppMesh Services) to "talk" to these 3rd parties. We have multiple Routes provisioned for each 3rd party, matching different endpoints by HTTP Method and Path Prefix and applying different retry / timeout rules. This works pretty well.

However, these apps do not really conform to REST standard, and sometimes they respond with 5xx or 4xx errors. after actually performing requests and modifying data internally in a way that will cause failure if the request is retried (please, don't ask)...

What we'd like to do, is to configure our Virtual Router Routes in a way, that would allow Envoys to perform retry on timeouts, and 504 errors from upstream, but not on 502 and 503, because they're "in use".

Are you currently working around this issue? We currently only use stream-error for http retry events, and that disables retry-on-timeout. We then implement retry mechanisms in code, which completely contradicts the idea of using AppMesh.

mkielar commented 1 year ago

See:

Looks like Envoy already supports both cases:

  1. The xx expansion is available by specifying values like 5xx or retriable-4xx.
  2. The selective configuration should be doable by using retriable-status-codes and a comma-separated list of HTTP Status Codes.

When looking at Envoy, it feels like a decision was made for AppMesh to provide unnecessary abstraction of the feature, preventing users from utilizing full functionality. I'm a bit confused trying to understand that decision.