Proper retry, timeout, etc..

Deepomatic / deepomatic-client-python

Python Client

Other

14 stars 0 forks source link

Proper retry, timeout, etc.. #17

Closed thomas-riccardi closed 4 years ago

thomas-riccardi commented 6 years ago

Regarding retries and timeouts: We should have both sane default behavior and everything controllable if user wants to fine-tune.

We need:

retry with exponential backoff, maybe capped, with jitter
timeout: both on each network call, and globally on the retry: we don't want to wait indefinitely because of infinite retry.

thomas-riccardi commented 6 years ago

timeout on Task.wait() is not controllable when called from InferenceResource.inference()

vdel commented 6 years ago

You can already control the timeout by doing:

task = InferenceResource.inference(return_task=True)
task.wait(timeout=123)
data = task['data']

I wanted to avoid such behavior because I would like to have a synchronous API for inference at some point, which would return directly the data field. But maybe we should make the above behaviour the default for now to better reflect the API ?

thomas-riccardi commented 6 years ago

Reopening: we still need timeouts at the network level: http://docs.python-requests.org/en/master/user/advanced/#timeouts

maingoh commented 5 years ago

Currently retrying is mainly done using tenacity (quite flexible). Though maybe we should combine it with https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#urllib3.util.retry.Retry ? Not sure if requests does some optimization depending on the exception.

thomas-riccardi commented 5 years ago

https://github.com/App-vNext/Polly/wiki/Transient-fault-handling-and-proactive-resilience-engineering and https://github.com/App-vNext/Polly/wiki/Polly-and-HttpClientFactory describes more advanced patterns:

circuit breaker: when the service seems completely broken it is not only useless to retry in loop, but it also increase load even more (which may prevent the service to recover). In such case it's best to directly return an error. The service is then regularly tested to see if it came back up.
bulkhead isolation

thomas-riccardi commented 5 years ago

One higher level sub-issue is the inference() method currently fails as soon as a 502 error is received for any GET /task in the internal task.wait(): it should not: we just don't know the new state, let's continue trying to get an up-to-date state (until succes/error or global wait timeout).

maingoh commented 4 years ago

Let's close this issue as I think most things have been fixed in #58. Reopen if needed (or open a new issue)