dapr / proposals

Proposals for new features in Dapr
Apache License 2.0
15 stars 33 forks source link

Sdk resiliency #33

Closed artursouza closed 1 year ago

artursouza commented 1 year ago

Proposal for https://github.com/dapr/dapr/issues/6609

jjcollinge commented 1 year ago

Just sharing this RFC from upstream gRPC on built in retry policies as it is implement in some languages and shows what you can expect from the gRPC clients. https://github.com/grpc/proposal/blob/master/A6-client-retries.md#retryable-status-codes

Also, handling of 429 should ideally respect a Retry-After header or the user should be able to opt out of retrying on that status code.

If I read this correctly it's up to the SDK to decide on the retry strategy? If the goal is to provide consistency here - why not make this consistent or configurable?

artursouza commented 1 year ago

https://github.com/grpc/proposal/blob/master/A6-client-retries.md#retryable-status-codes

I tried using that in the Java SDK and did not perform the retry per request. It did not handle timeout retries if the server side hangs and client times out. I don't mind if this standard implementation works for other SDKs.

artursouza commented 1 year ago

Just sharing this RFC from upstream gRPC on built in retry policies as it is implement in some languages and shows what you can expect from the gRPC clients. https://github.com/grpc/proposal/blob/master/A6-client-retries.md#retryable-status-codes

Also, handling of 429 should ideally respect a Retry-After header or the user should be able to opt out of retrying on that status code.

If I read this correctly it's up to the SDK to decide on the retry strategy? If the goal is to provide consistency here - why not make this consistent or configurable?

Regarding 429, I think SDKs should respect the header. I will mention that. We can make this more configurable too, but I am concerned about it exploding into a huge resiliency spec, like in runtime.

Regarding the resiliency strategy, I did not add it here to reduce the scope of the discussion. We need something today and we can expand into more fine grained details later. My concern, again, is to have this to expand into a full resiliency spec like we have in runtime and this taking far too long to be agreed and implemented. We need something today since the experience is poor already if the API shows any connectivity issues.

artursouza commented 1 year ago

I have updated the proposal. Changes:

  1. Timeout defaults to "undefined", so it is not a breaking change.
  2. Add params per-request and per-client to overwrite the ENV variable when needed (including priority among them)
  3. Respect Retry-After header for 429

I have not detailed about how the retry is implemented. Maybe maintainers from SDK have also opinions about this.

msfussell commented 1 year ago

+1 Updated proposal looks good.

JoshVanL commented 1 year ago

+1

yaron2 commented 1 year ago

+1 binding

artursouza commented 1 year ago

+1 binding

shubham1172 commented 1 year ago

+1 binding

XavierGeerinck commented 1 year ago

+1! definitely helpful as it would resolve issues I am having with running on App Service :) (requirement sadly enough)

halspang commented 1 year ago

+1 binding

artursouza commented 1 year ago

-1 now means infinity for retries - same behavior as in runtime.

yaron2 commented 1 year ago

Proposal is approved.