Closed drjova closed 5 days ago
The current dashboard already displays https errors, it's just that there are none. The latest problem we had with requests timing out when using the endpoint, was related to a dead pod/node. To prevent this I guess we could introduce a liveness and/or readiness probe to the pods.
I've prepared an example dashboard that would at least give us some more information about response time and memory usage:
Regarding alerts, I guess we could alert in case a pod exceeds a response time threshold or memory usage. Something like the amount of API calls in the last 24h is as far as I know, very fluctuation so not really practical.
Added liveness probe to classifier pods.
We have a monitoring for classifier that it's not very useful. We have to see all http errors, memory usage and ideally send an error message to zulip to alert us.