aesiniath / unbeliever

Opinionated Haskell Interoperability
https://hackage.haskell.org/package/unbeliever
MIT License
33 stars 11 forks source link

Retry sending more aggressively #189

Closed istathar closed 1 year ago

istathar commented 1 year ago

If Core.Telemetry.Honeycomb encounters an HTTP 500 response from Honeycomb it needs to retry (possibly indefinitely) sending the spans it has queued up. Otherwise a transient error on their side may result in lost telemetry data.

dmvianna commented 1 year ago

You would want to also record any information that might help figure out if the problem is within your network. Else you will indeed retry forever.

istathar commented 1 year ago

In this case we're getting 500 errors (which I've never seen before) because Hoenycomb is having an incident on their ingestion pipeline.

The output in the process's logs was:

internal: Unexpected status returned;
[{"error":"something went wrong!","status":500},{"error":"something went wrong!","status":500},{"error":"something went wrong!","status":500}]

which is our code outputting the error response (for lack of anything better to do). I'm happy to have it retry in such a situation.

istathar commented 1 year ago

This message comes from:

https://github.com/aesiniath/unbeliever/blob/f77d2c133cb8280f25f2fd527b5c6eb7c10207ae/core-telemetry/lib/Core/Telemetry/Honeycomb.hs#L406-L408

which is clearly rubbish, but you're right @dmvianna we need to express at least a little finesse about how we chose to retry here.

We could throw an exception which would cause the "retry exactly once" loop to run, but in a situation where the service is degraded we would still lose data.

dmvianna commented 1 year ago

I like to have exponential back offs in these cases. You really don't want to contribute to the mass off zombie processes retrying all at once when Honeycomb (or any other service) has a hiccup.

dmvianna commented 1 year ago

So, say, retry immediately, then in t ms, then t 2, then t 2 * 2...

istathar commented 1 year ago

Our experience of exponential backoff in the queues in a previous job we did together was ... less that ideal. I'm not a huge fan now as a result. It can just wait a few seconds and retry?