Closed istathar closed 1 year ago
You would want to also record any information that might help figure out if the problem is within your network. Else you will indeed retry forever.
In this case we're getting 500 errors (which I've never seen before) because Hoenycomb is having an incident on their ingestion pipeline.
The output in the process's logs was:
internal: Unexpected status returned;
[{"error":"something went wrong!","status":500},{"error":"something went wrong!","status":500},{"error":"something went wrong!","status":500}]
which is our code outputting the error response (for lack of anything better to do). I'm happy to have it retry in such a situation.
This message comes from:
which is clearly rubbish, but you're right @dmvianna we need to express at least a little finesse about how we chose to retry here.
We could throw an exception which would cause the "retry exactly once" loop to run, but in a situation where the service is degraded we would still lose data.
I like to have exponential back offs in these cases. You really don't want to contribute to the mass off zombie processes retrying all at once when Honeycomb (or any other service) has a hiccup.
So, say, retry immediately, then in t ms, then t 2, then t 2 * 2...
Our experience of exponential backoff in the queues in a previous job we did together was ... less that ideal. I'm not a huge fan now as a result. It can just wait a few seconds and retry?
If Core.Telemetry.Honeycomb encounters an HTTP 500 response from Honeycomb it needs to retry (possibly indefinitely) sending the spans it has queued up. Otherwise a transient error on their side may result in lost telemetry data.