astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
192 stars 47 forks source link

Implement some observability/monitoring part-1 #39

Closed sunank200 closed 10 months ago

sunank200 commented 11 months ago
pankajastro commented 11 months ago

We have the below components if anything goes down. @sunank200 Any thoughts on what we want to monitor?

    # Weaviate
    # Firestore
    # API server
    # UI
    # Slack
    # Airflow
sunank200 commented 11 months ago

Let's monitor cloud-run APIs with https://cloud.google.com/blog/products/serverless/cloud-run-healthchecks/. There is an option to run health checks in cloud run directly.

Do we need it for the firestore? That's overkill for now.

We need a health check for the API server using a health check on the cloud run. We also need to check for the UI similarly. An airflow DAG can be a better option for UI.

For airflow, we have ingestion and feedback DAGs. Do we need it for them now?

sunank200 commented 11 months ago

SLA - if it goes down.

Add slack channel if service is down. This includes

sunank200 commented 11 months ago

Monitoring this DAG on daily basis is needed: https://cloud.astronomer.io/clmkpupdk000401lpj28teo2t/deployments/clo5em1ec2106164zxof2uulcqu/overview

sunank200 commented 11 months ago

We should have a slack alerts like we have for providers on internal Astronomer slack

sunank200 commented 11 months ago

Discussed the task breakdown for the observability task:

We will have a Slack channel with status on a daily basis.

To be taken care of by @pankajastro :

To be taken care of by @sunank200

pankajastro commented 11 months ago

waiting on https://galileo.astronomer.io/support/tickets/6948 for slack bot

pankajastro commented 10 months ago

waiting on PR review