beam-telemetry / telemetry

Dynamic dispatching library for metrics and instrumentations.
https://hexdocs.pm/telemetry
Apache License 2.0
872 stars 66 forks source link

Emit telemetry failure event on handler failure #98

Closed bryannaegele closed 3 years ago

bryannaegele commented 3 years ago

We've discussed the need for a mechanism by which users could react or monitor a handler failure which subsequently results in the handler being detached. Currently handler detachments are somewhat silent in that we only log an error message. For some handlers and use cases, this could be sufficient. However, if a user is highly dependent on an event for monitoring a key metric they likely need to be alerted of this failure through other monitoring infrastructure, e.g. metric events.

For example, a user is monitoring the size of open orders in their database using poller to emit periodic events with this number and they report this out using a gauge. If the handler were to fail and detach, the gauge would be "stuck" at the last reported value with no indication in monitoring that something is amiss. The only way the user could be alerted that this failed is to be using a logging service and setting an alert looking for patterns of handlers failing. This is particularly concerning with reliance on telemetry for tracing.

By emitting a failure event we are able to give users a mechanism by which they could monitor for failures using the infrastructure they are already using for monitoring/alerting. This also allows for potential error recovery attempts by advanced library authors to attempt some recovery strategy before abandoning completely, though this is strongly cautioned against.

arkgil commented 3 years ago

@bryannaegele it might be a stupid question, but should we detach handlers of the failure event? If the failure handler fails, then we're back in the situation where only the log is emitted.

bryannaegele commented 3 years ago

@arkgil They have to subscribe to the failure event, so hopefully they're more careful there πŸ˜† But it's a good question. I think the original motivation of detaching failed handlers because they'll likely fail again should probably still apply here. What do you think?

The main motivation here is to give users/library authors a secondary way to monitor handlers failing without having to only do it in logging solutions like Datadog where they also have no way to potentially handle that failure in-app.

josevalim commented 3 years ago

Yeah, I think we should detach them otherwise they can go into an infinite loop of failure too.

arkgil commented 3 years ago

That makes sense. I like it, a nice addition! πŸ‘

bryannaegele commented 3 years ago

I think we're good?

hkrutzer commented 2 years ago

Could a new version be released with this change? πŸ™

josevalim commented 2 years ago

We need to at least wait for #102 to be addressed.

hkrutzer commented 2 years ago

@josevalim cool, looks like that one is working πŸ™‚