Open josephjclark opened 7 months ago
This keeps coming up so I think we want to spend some time on it.
I think there are two seperate but related big issues right now: 1) benchmarking: local tests on the worker performance. We want to better understand or current performance and how it scales. This also lets us verify that future improvements are helping 2) Transparency: we need to better understand what the worker is doing in live environments. Does this mean more eventing? More logging? Can we have a live dashboard? Can we output performance metrics?
Some quick thoughts about possible performance bottlenecks:
An epic issue to have oversight over monitoring on the worker.
The high level brief is: we need better visibility of what's going on inside the worker, especially when things go wrong.
We should consider metrics tracking, sentry reporting, email notification, grafana, etc.
Related:
603
402
Things we want
We need to figure out the best approach for how to integrate this into prometheus, do we expose an aggregate http service (or use lightning for that) that collects up the metrics?
We probably don't want to use service discovery for monitoring? Do we? There is an advantage of workers exposing their own
/metrics
server, makes the worker better for everyone.