TBD54566975 / ftl

FTL - Towards a 𝝺-calculus for large-scale systems
https://tbd54566975.github.io/ftl/
Apache License 2.0
21 stars 7 forks source link

telemetry data is being lost #3018

Open mistermoe opened 4 days ago

mistermoe commented 4 days ago

Repro

  1. run just otel-stream
  2. run just otel-dev
  3. Send 2-3 calls to echo.Echo
  4. ctrl+c before flush interval (5s)
  5. notice missing metrics

Potential Fix

AFAICT, Shutdown isn't being called for the otel exporters or any of the providers in the observability client. This can lead to collected telemetry data getting dropped on the ground because it doesn't get flushed before shutting down.

Each otel provider and exporter has a Shutdown method (e.g. docs for metrics exporter's shutdown). We just need to call Shutdown on all of them. might make sense to surface an observability.Shutdown method that takes care of calling Shutdown on all of the underlying exporters and providers.

I'm guessing we have to call Shutdown on all of them vs. just the Exporter, because from what i can tell, the providers flush to the exporter's internal cache and the exporter flushes to wherever it's been configured to export to

wesbillman commented 3 days ago

This makes me think it might be nice to have some Shutdown functions in controller and runner to allow us to clean up/gracefully shutdown. I'm guessing these could be implemented as counterparts to the Start functions we have for each now. @alecthomas is there a typical way this should be handled?

safeer commented 3 days ago

Generalize to context cancellation on shutdown. Use the waitgroup in the controller?