Implement some observability/monitoring part-1

astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer

https://ask.astronomer.io/

Apache License 2.0

192 stars 47 forks source link

Implement some observability/monitoring part-1 #39

Closed sunank200 closed 10 months ago

sunank200 commented 11 months ago

firebase:
Weaviate:
UI
langsmith: (low priority)

pankajastro commented 11 months ago

We have the below components if anything goes down. @sunank200 Any thoughts on what we want to monitor?

    # Weaviate
    # Firestore
    # API server
    # UI
    # Slack
    # Airflow

sunank200 commented 11 months ago

Let's monitor cloud-run APIs with https://cloud.google.com/blog/products/serverless/cloud-run-healthchecks/. There is an option to run health checks in cloud run directly.

Do we need it for the firestore? That's overkill for now.

We need a health check for the API server using a health check on the cloud run. We also need to check for the UI similarly. An airflow DAG can be a better option for UI.

For airflow, we have ingestion and feedback DAGs. Do we need it for them now?

sunank200 commented 11 months ago

SLA - if it goes down.

Add slack channel if service is down. This includes

APIs
UI
response time is very high more than 30 seconds
slack bot

sunank200 commented 11 months ago

Monitoring this DAG on daily basis is needed: https://cloud.astronomer.io/clmkpupdk000401lpj28teo2t/deployments/clo5em1ec2106164zxof2uulcqu/overview

sunank200 commented 11 months ago

We should have a slack alerts like we have for providers on internal Astronomer slack

sunank200 commented 11 months ago

Discussed the task breakdown for the observability task:

We will have a Slack channel with status on a daily basis.

To be taken care of by @pankajastro :

firebase: Check if db exists
Weaviate: some python script to check no of record
UI: poll ask.astronomer.io
Airflow => Astro self
we need a slack app and slack channel and add an app in channel 1 dag which will collect status and post in the slack channel

To be taken care of by @sunank200

API: Each API monitoring (explore postman)
Slack bot: [1. live, 2. user 3. question] (Needs more discussion)
diff channel => [Test case]

pankajastro commented 11 months ago

waiting on https://galileo.astronomer.io/support/tickets/6948 for slack bot

pankajastro commented 10 months ago

waiting on PR review