OpenFn / kit

The bits & pieces that make OpenFn work. (diagrammer, cli, compiler, runtime, runtime manager, logger, etc.)
10 stars 9 forks source link

Epic: Worker Monitoring #608

Open josephjclark opened 7 months ago

josephjclark commented 7 months ago

An epic issue to have oversight over monitoring on the worker.

The high level brief is: we need better visibility of what's going on inside the worker, especially when things go wrong.

We should consider metrics tracking, sentry reporting, email notification, grafana, etc.

Related:

603

402

Things we want

We need to figure out the best approach for how to integrate this into prometheus, do we expose an aggregate http service (or use lightning for that) that collects up the metrics?

We probably don't want to use service discovery for monitoring? Do we? There is an advantage of workers exposing their own /metrics server, makes the worker better for everyone.

josephjclark commented 1 month ago

This keeps coming up so I think we want to spend some time on it.

I think there are two seperate but related big issues right now: 1) benchmarking: local tests on the worker performance. We want to better understand or current performance and how it scales. This also lets us verify that future improvements are helping 2) Transparency: we need to better understand what the worker is doing in live environments. Does this mean more eventing? More logging? Can we have a live dashboard? Can we output performance metrics?

Some quick thoughts about possible performance bottlenecks: