Closed sunank200 closed 10 months ago
We have the below components if anything goes down. @sunank200 Any thoughts on what we want to monitor?
# Weaviate
# Firestore
# API server
# UI
# Slack
# Airflow
Let's monitor cloud-run APIs with https://cloud.google.com/blog/products/serverless/cloud-run-healthchecks/. There is an option to run health checks in cloud run directly.
Do we need it for the firestore? That's overkill for now.
We need a health check for the API server using a health check on the cloud run. We also need to check for the UI similarly. An airflow DAG can be a better option for UI.
For airflow, we have ingestion and feedback DAGs. Do we need it for them now?
SLA - if it goes down.
Add slack channel if service is down. This includes
Monitoring this DAG on daily basis is needed: https://cloud.astronomer.io/clmkpupdk000401lpj28teo2t/deployments/clo5em1ec2106164zxof2uulcqu/overview
We should have a slack alerts like we have for providers on internal Astronomer slack
Discussed the task breakdown for the observability task:
We will have a Slack channel with status on a daily basis.
To be taken care of by @pankajastro :
To be taken care of by @sunank200
waiting on https://galileo.astronomer.io/support/tickets/6948 for slack bot
waiting on PR review