dapr / python-sdk

Dapr SDK for Python
Apache License 2.0
222 stars 125 forks source link

Retry and Timeout policies for grpc and http #679

Closed elena-kolevska closed 2 months ago

elena-kolevska commented 6 months ago

Description

This PR adds the possibility for users to configure Retry and Timeout policies for both grpc and http based clients. The retry is configurable through the DAPR_API_MAX_RETRIES environment variable and is picked up automatically for both the grpc and http client, not requiring any code changes. I left the possibility for a more fine-grained control over the retry mechanism (the client can receive a RetryPolicy object), but I'm not sure if that's something people would need. Let's discuss it.

Issue reference

https://github.com/dapr/python-sdk/issues/676

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 89.59538% with 18 lines in your changes missing coverage. Please review.

Project coverage is 86.15%. Comparing base (fc0e9d1) to head (63548df). Report is 28 commits behind head on main.

Files Patch % Lines
dapr/clients/retry.py 82.50% 14 Missing :warning:
dapr/conf/__init__.py 50.00% 2 Missing :warning:
dapr/actor/client/proxy.py 50.00% 1 Missing :warning:
dapr/aio/clients/grpc/client.py 90.90% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #679 +/- ## ========================================== - Coverage 86.37% 86.15% -0.22% ========================================== Files 79 81 +2 Lines 4094 4282 +188 ========================================== + Hits 3536 3689 +153 - Misses 558 593 +35 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

elena-kolevska commented 6 months ago

The lint tests are currently failing because of a change introduced in Ruff 0.3 which is incompatible with our flake8 set-up. We should solve this issue in a separate PR.

berndverst commented 4 months ago

Is this PR ready @elena-kolevska ?

I kept the ruff version at 0.2.2 for now. Lint and Ruff are passing with that.

elena-kolevska commented 4 months ago

It's almost ready, I just wanted to confirm a few things with you:

berndverst commented 3 months ago

It's almost ready, I just wanted to confirm a few things with you:

  • Right now the code will retry if the env variable is set, with no additional code changes. I also made it possible for users to tune other retry-related parameters (like initial_backoff, max_backoff, retryable_http_status_codes for ex.). Do you think the need ffor it justifies the added code complexity, and heavier constructors?
  • For now, I only added the retry wrapper to one method, the save_state one. Once we confirm this is the right approach, I'll update all methods in the class.

I think that's probably fine, but why wouldn't we instead add this as a decorator on the save_state method instead? Would that be possible? Right now you are not actually using a decorator but instead a method that takes another method as the argument (which is very similar to how decorators work). I just find decorators a lot cleaner. Keep in mind that you can always have the decorator intercept arbitrary **kwargs and then receive whatever properties you need to tweak retry behavior, without having to modify the method signature of the function.

  • When I first started work on this PR I added a timeout interceptor that would be applied to all calls. Looking at it now, I think it makes more sense to be able to specify a timeout per method. Some things are expected to take longer than others, and control should be more granular. What do you think?

How do you plan for users to be able to tweak the individual method timeouts? As optional arguments during each of the methods? That would work if you use decorators, because decorators can intercept arguments.

We should not add any new parameters to methods or the constructor - everything should be optional to not cause a breaking change and we should choose either sane defaults or retain the existing behavior, but then document how this can be set.

Let's also make sure using an env variable will not be the only way to control this behavior by the way.

elena-kolevska commented 3 months ago

Yeah, decorators were my first try, and I need to go back and remember why we decided it wasn't the best option. I'll get back to you on it. In the meantime, on the topic of timeouts: looks like we already accept a timeout parameter for the invoke_method function, which is where users would usually need to tweak per call. We could still allow users to control their global timeout for all other calls (get state, save state...) because those are going to be dependant on their network. So I think we can leave the code as it is.

elena-kolevska commented 3 months ago

I remembered why we're not using decorators: when we retry we would have to call again the whole function that prepares the request, instead of just the call, so it's better to keep the code more performant and use a single function.

I also went ahead and updated all the calls and marked the PR as ready to review 🙏