client HTTP tooling: plan for retryable errors, set specific goal: how long to retry before accepting data loss (robustness in view of micro outage, maintenance, etc)

jgehrcke commented 1 year ago

When an ephemeral CI system generates benchmark results and then uses client tooling that we build in order to try to push the data into a system designed for analysis and long-term storage (Conbench API) we should make a choice about how hard that client tooling is trying to get the data there before giving up.

Giving up probably means data loss. We want to have a stance on that.

Let's set goals. We may want to retry (with considerable back-off) for up to O(1 hour); this is what I would like for us to consider at least, for certain CI-powered use cases.

In https://github.com/conbench/conbench/pull/788#discussion_r1122081341 I commented:

We should have a pretty clear idea about how long recommended client tooling (tooling that we build) attempts to retry, before it drops data and moves on. In similar systems I have always advocated for and also built an explicit, tested design choice: "we want to survive a micro outage / maintenance window of N minutes length; that is our promise". Such a deliberate goal makes it easy to develop/test, and it also makes it easy to communicate guarantees to users.

This ticket here relates to a pattern I have observed over the years. In distributed systems, things often fail in a user-facing way (and lead to frustrating user experience) because developers didn't carefully handle expected, retryable errors. For me, it's the number one source of bugs I have seen in the space of distributed systems. Distributed systems are inherently prone to transient problems at the RPC interface; and yet often times individual HTTP calls are built like function calls that never leave machine boundaries.

Key ingredient to achieving a more or less pleasant distributed system with guarantees is to have specific goals in mind from the start, towards non-ignorant error handling. It helps a lot to handle all expected code paths around how an HTTP request/response cycle can fail, making a clear distinction between retryable errors and non-retryable errors. That distinction is sometimes obvious, but often it also requires application-specific decision-making and goal-setting. Certainly, the API implementation needs to do a decent job communicating errors to clients, and therefore also a good internal error handling job.

When retrying on client side a question is "how long, how often"? Data locality, general context, the value of the ephemeral data, and deliberate goal-setting for the overall user experience help answering these questions.

In the client tooling we build we should make a deliberate choice.

This goes hand-in-hand with making good choices for the involved timeout criteria and timeout constants: https://github.com/conbench/conbench/issues/801.

I want to add: CI tooling is special with respect to retrying because one does not need to optimize for latency so much (bringing feedback quickly to a human), but one can exchange time budget into robustness.

We as engineers appreciate when CI jobs complete quickly, but we appreciate it even more when they complete successfully. Then we have less debugging to do, and generally can work more productively.

austin3dickey commented 1 year ago

A great first step is

retry (with considerable back-off) for up to O(1 hour)

I think we should prioritize this soon based off recent internal CI runs in the last couple of weeks. In all failure cases, they were retryable errors that may have been prevented if we just kept trying for longer than 1 minute.

alistaire47 commented 1 year ago

The flipside of O(1 hour) backoff is that for a more significant outage we can't recover from, runs with a lot of results (say O(1000)) will take a looong time to fail. That's potentially fine (it's not a situation we want to happen), but important to bear in mind if they're running on expensive cloud hardware or dedicated hardware with a busy queue.

austin3dickey commented 1 year ago

runs with a lot of results (say O(1000)) will take a looong time to fail

Do you know of a system that doesn't fail immediately if one result fails to post? I can't think of one off the top of my head but could be missing something.

alistaire47 commented 1 year ago

benchadapt does have some gnarly try/except for timeouts on posting, but it won't progress to the next result. Mostly you're right, as far as I can tell, but given the recent turbulence and therefore likelihood users will add more safeguards, maybe we need to document that users shouldn't ignore failed results. The more obvious reason to do so is when there's a possibility a pre-computed result may result in a malformed payload, but mostly I think our tooling is pretty good at preventing that.

austin3dickey commented 1 year ago

The more obvious reason to do so is when there's a possibility a pre-computed result may result in a malformed payload

Yes, I think you're talking about "non-retryable errors" in the context of the original comment above. Those cases should be covered by different status codes than https://github.com/conbench/conbench/blob/e1cab9b8ebffaba1c85e454b1b13de7f14bfcf02/benchclients/python/benchclients/base.py#L35

jgehrcke commented 1 year ago

In all failure cases, they were retryable errors that may have been prevented if we just kept trying for longer than 1 minute.

Yes!

(say O(1000)) will take a looong time to fail

Right, that's one downside. But a higherlevel / more global timeout constraint is typically addressing this. It's all a trade-off, for sure.

We made an important step towards the goal expressed in this ticket: reworked the inner workings of ConbenchClient to do explicit retrying here: https://github.com/conbench/conbench/pull/1148.

It's currently configured to retry an HTTP request for around ~30 minutes: https://github.com/conbench/conbench/blob/20c0950e44035e81ca750509f907c01e78d9fce9/benchclients/python/benchclients/conbench.py#L44

jgehrcke commented 1 year ago

I would like to share a log excerpt that nicely shows the good work that #1148 / RetryingHTTPClient is doing for us behind the scenes, all the time:

230502-09:51:56.925 INFO: try: POST to https://cb.piffel.poffel/api/benchmarks/
230502-09:52:00.428 INFO: error during request/response cycle (treat as retryable, retry soon): HTTPSConnectionPool(host='cb.piffel.poffel', port=443): Read timed out. (read timeout=3.5)
230502-09:52:00.428 INFO: cycle 1 failed, wait for 0.7 s, deadline in 29.9 min
230502-09:52:04.735 INFO: error during request/response cycle (treat as retryable, retry soon): HTTPSConnectionPool(host='cb.piffel.poffel', port=443): Read timed out. (read timeout=3.5)
230502-09:52:04.735 INFO: cycle 2 failed, wait for 1.3 s, deadline in 29.9 min
230502-09:52:09.739 INFO: error during request/response cycle (treat as retryable, retry soon): HTTPSConnectionPool(host='cb.piffel.poffel', port=443): Read timed out. (read timeout=3.5)
230502-09:52:09.739 INFO: cycle 3 failed, wait for 2.7 s, deadline in 29.8 min
230502-09:52:16.079 INFO: error during request/response cycle (treat as retryable, retry soon): HTTPSConnectionPool(host='cb.piffel.poffel', port=443): Read timed out. (read timeout=3.5)
230502-09:52:16.079 INFO: cycle 4 failed, wait for 5.3 s, deadline in 29.7 min
230502-09:52:25.055 INFO: error during request/response cycle (treat as retryable, retry soon): HTTPSConnectionPool(host='cb.piffel.poffel', port=443): Read timed out. (read timeout=3.5)
230502-09:52:25.055 INFO: cycle 5 failed, wait for 10.7 s, deadline in 29.5 min
230502-09:52:35.907 INFO: POST request to https://cb.piffel.poffel/api/benchmarks/: took 0.1759 s, response status code: 503
230502-09:52:35.907 INFO: unexpected response (code: 503, body bytes: <<html>
...
230502-09:52:35.907 INFO: cycle 6 failed, wait for 21.3 s, deadline in 29.4 min
230502-09:52:57.297 INFO: POST request to https://cb.piffel.poffel/api/benchmarks/: took 0.0385 s, response status code: 503
230502-09:52:57.298 INFO: unexpected response (code: 503, body bytes: <<html>
...
230502-09:52:57.298 INFO: cycle 7 failed, wait for 42.7 s, deadline in 29.0 min
230502-09:53:40.042 INFO: POST request to https://cb.piffel.poffel/api/benchmarks/: took 0.0386 s, response status code: 503
230502-09:53:40.042 INFO: unexpected response (code: 503, body bytes: <<html>
...
230502-09:53:40.042 INFO: cycle 8 failed, wait for 60.0 s, deadline in 28.3 min
230502-09:54:40.350 INFO: POST request to https://cb.piffel.poffel/api/benchmarks/: took 0.2513 s, response status code: 503
230502-09:54:40.350 INFO: unexpected response (code: 503, body bytes: <<html>
...
230502-09:54:40.350 INFO: cycle 9 failed, wait for 60.0 s, deadline in 27.3 min
230502-09:55:40.713 INFO: POST request to https://cb.piffel.poffel/api/benchmarks/: took 0.3057 s, response status code: 201

The micro outage took ~four minutes.

Across the retry attempts, two error symptoms were seen:

TCP read() timeout during fetching response (here we see that I mixed up items in the 2-tuple for timeout constants in the heat of the moment -- fixing that in https://github.com/conbench/conbench/pull/1206)
503 response as of k8s re-deployment (and rollingupdate - service integration misconfig?)

We see that the backoff mechanism worked well. We see that after ~four minutes after all a 201 response came in. We see that this would have retried the same request for another ~27 minutes.

conbench / conbench

client HTTP tooling: plan for retryable errors, set specific goal: how long to retry before accepting data loss (robustness in view of micro outage, maintenance, etc) #800