astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
192 stars 47 forks source link

Alert only on relevant GCP error in slack #243

Closed pankajastro closed 6 months ago

pankajastro commented 8 months ago

Recently, we integrated GCP error log alerts into our monitoring channel, and since then, the channel has been flooded with alert messages. Upon reviewing the logs, it seems that the errors may not be relevant. I suggest we investigate the possibility of configuring alerts to notify only for errors that are pertinent to our operations

slack event: https://astronomer.slack.com/archives/C063391HTGA/p1704172035215369

log explorer in case want to check error on which it alert update project link

https://console.cloud.google.com/logs/query;query=severity%3DERROR%20severity%3E%3DDEFAULT;pinnedLogId=2023-12-30T11:29:25.071199Z%2F-sa93vce1q57e;cursorTimestamp=2023-12-30T11:29:25.825394Z;aroundTime=2023-12-30T11:29:25.071199Z;duration=PT1H?project=-

pankajkoti commented 8 months ago

This was configured on 30th December. And there are 4 alerts till now in 4 days in the Slack channel. It sends an error whenever there is an ERROR log in the application.

The current filter for the alerts is log severity=ERROR. If we can have more fine grained filter we can apply such enhanced filter. But if we're not able to identify such a filter I think we will be limited with what GCP offers until we forward application logs to Chronosphere where we would have more flexibility and control perhaps.

But @pankajastro why do you think the alerts are not relevant? When I click on the Incident from the Slack messages, I see it takes to application failure logs like status 500.

pankajkoti commented 8 months ago

This is what the error incident shows on clicking on it and they seem to legit application error logs

Screenshot 2024-01-02 at 1 12 19 PM
pankajastro commented 8 months ago

hmm, I checked the log more and looks like the question asked by the user was very large(char). so maybe we should restrict the question?

davidgxue commented 6 months ago

Update

Summary of Investigation

Action Items

davidgxue commented 6 months ago

Closing this issue now that the error alerting is no longer spammy