Fix monitoring for classifier

drjova commented 2 weeks ago

We have a monitoring for classifier that it's not very useful. We have to see all http errors, memory usage and ideally send an error message to zulip to alert us.

PascalEgn commented 1 week ago

The current dashboard already displays https errors, it's just that there are none. The latest problem we had with requests timing out when using the endpoint, was related to a dead pod/node. To prevent this I guess we could introduce a liveness and/or readiness probe to the pods.

I've prepared an example dashboard that would at least give us some more information about response time and memory usage:

Regarding alerts, I guess we could alert in case a pod exceeds a response time threshold or memory usage. Something like the amount of API calls in the last 24h is as far as I know, very fluctuation so not really practical.

PascalEgn commented 6 days ago

Updated Grafana Dashboard: https://grafana.siscern.org/d/0TkuXReSzss/classifier?orgId=1&refresh=5s&var-namespace=inspire-prod&var-method=POST&var-path=%2Fapi%2Fpredict%2Fcoreness&from=now-15m&to=now

PascalEgn commented 6 days ago

Added liveness probe to classifier pods.

cern-sis / issues-inspire

Fix monitoring for classifier #560