UserOfficeProject / issue-tracker

Shared place for features and bugs from all collaborators.
0 stars 0 forks source link

Add Health Endpoint #1176

Open TCMeldrum opened 5 days ago

TCMeldrum commented 5 days ago

Recently, we saw some db connections issues, we have a ping monitoring job, but this just checks the app is live. We do have a /health endpoint: https://github.com/UserOfficeProject/user-office-core/blob/develop/apps/backend/src/middlewares/healthCheck.ts but it is not very detailed.

I think we should add a check to the database, RabbitMQ and maybe the external auth server. We could also add a call to the factory health check endpoint as well.

From August/September db issues.

janosbabik commented 5 days ago

Hi @TCMeldrum, I would not change the health endpoint. Its job is to respond with a 200 status code, which means the application is running, and nothing more.

Do you collect the logs from the app somewhere, like Graylog? If yes you can define alerts based on error logs. When the application throws a db error due to a query failure, you will receive notifications. We use Graylog server where alerts are defined based on log levels, and we receive email notifications.

However, this does not happen when there is no query. We can create a periodic database connection check via running a simple "SELECT 1" query every X seconds. If it fails, Knex will throw an error, and we can get notifications.

We are planning to integrate the Prometheus client, which can be used to export application metrics. We can create custom metrics. For example, we could define a metric for the database connection state (or RabbitMQ connection, etc.), which can be scraped to trigger alerts if the application is not connected to the database.

https://prometheus.io/

TCMeldrum commented 4 days ago

We use logstash to ship our logs to an elasticsearch (migrating to opensearch) and we have recently set up a prometheus instance to motor rabbitmq. So maybe we could do something similar. That being said, I think having some more detailed health endpoint would be nice especially with docker health checking: https://docs.docker.com/reference/dockerfile/#healthcheck and k8s health checks https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ could be useful..

Maybe a topic for discussion in wednesday meeting

janosbabik commented 4 days ago

Unfortunately, I will not be able to attend the next meeting, so I will share my thoughts here. :)

For the liveness probe, I wouldn't check the database because the kubelet kills the pod if it fails, which is not necessary since Knex reconnects to the database when possible.

For the readiness endpoint, we could have an /health/db endpoint, for example, where a SELECT 1 query runs. Based on the result, the kubelet can enable traffic to the app or not. When the SQL query succeeds, the app will be considered ready again.