FEDS needs a monitoring system

mccabete commented 2 months ago

FEDS needs a monitoring system for when the data and github actions go down.

zebbecker commented 2 months ago

One concept I have for a quick start is to create a dashboard or report that monitors how many pixels/detections we are getting from each satellite at each timestep. If this drops to zero for any stellite, obviously our ingest is broken. We can have an automated alert on that. Doing it this way might also allow us to catch trends that indicate a problem short of the ingest being down entirely- I'm imagining some sort of unknown issue with shifting orbits or ingest timing that results in getting some pixels, but not the full range we want. If SNPP suddenly trended down 50% while NOAA-20 didn't, for example, that would be concerning.

@jsignell @ranchodeluxe is this a good/bad direction to start out in? And, if I were to do this, where do you think it would make the most sense to log this information? @mccabete where might you want to be able to check in this?

ranchodeluxe commented 2 months ago

Doing it this way might also allow us to catch trends that indicate a problem short of the ingest being down entirely- I'm imagining some sort of unknown issue with shifting orbits or ingest timing that results in getting some pixels, but not the full range

Yeah, let's think about "monitoring" in different levels/situations:

I think what you are outlining above is that we need better monitoring about when SNPP, NOAA20, NOAA21 (in the future) satellites are giving us bad data (some of what this ticket talks about in the "ReachGoal": https://github.com/Earth-Information-System/fireatlas/issues/136). I'm fine with whatever we decide to come up with here. It can just bail and throw an exception that bubbles and is never caught. Zeb and I talked about checking for zero and also doing a statistical average with some kind a standard deviation check. Ownership: definitely EIS Fire team
We also need to alert when any DPS job fails. We basically already have this without DPS creating something new for by running a scheduled check for jobs that ran within the last hour: https://github.com/Earth-Information-System/fireatlas/actions/workflows/schedule-alert-failed-dps-jobs.yaml. Ownership: definitely EIS Fire team based on status DPS gives us back
We also need to alert when ingestions fail. We kinda already have this covered through CloudWatch dashboards (like the existing one we have in AWS UAH account) and CloudWatch alarms (see below). Ownership: veda-data-services team b/c they control where the Airflow ingest runs. But we have options on how to interact with that. For example, we could request a token to get status information about EIS ingest from Airflow and alert on that just like we did above for DPS. Or we could request that CW dashboards be exported or exposed or alarmed on for your needs

ranchodeluxe commented 2 months ago

Zeb and I are talking and here are some notes about AWS and where things are and codebases. So here are some repos that show you how to provision workflows in AWS for the Fire EIS Team. AWS VPC best practices.

Legacy/current API code and IaC (in Terraform) for setting up the API at https://firenrt.delta-backend.com/. This API uses Elastic Container Services and is a little more expensive. Hence why we are moving to a new API mentioned 2 below. Repo: https://github.com/NASA-IMPACT/veda-features-api
New API code and IaC (in CDK) for settting up the API using Lambdas. This will be at a different domain. Repo: https://github.com/NASA-IMPACT/veda-features-api-cdk/
Ingest from VEDA S3 bucket to Veda Airflow into the database that backs the API at https://firenrt.delta-backend.com/. Repo: https://github.com/NASA-IMPACT/veda-data-airflow/blob/dev/docker_tasks/vector_ingest/handler.py

mccabete commented 2 months ago

I second Greg's points about watching out for timezone issues, and that the monitoring is really about data quality no just data there vs not there. We haven't discussed that we might be able to just tie an alert to the places that have put out bulletins on when the VIIRS data quality takes a hit. -- Although we have definitely run into issue faster than/ when they havne't been posted. I think some of that comes from the specific files we are using.

ranchodeluxe commented 2 months ago

@zebbecker: I'm pretty sure anyone who subscribes to all notifications from this repository will get the emails about failed jobs too? Do I have this wrong? I mean, Slack is great work too!

zebbecker commented 2 months ago

Unfortunately, yeah I think that is wrong. I only get emails for "my" workflows: ones that I manually triggered with a workflow dispatch or the pytest workflow when it is triggered by an action that I take. So, since you set up the scheduled runs, you would get all the emails that are needed, but there is no way to automatically send those emails to other users (e.g. me). It is a known issue/feature request to have a way to configure GitHub so that more than just the action owner can be notified, but that request has been open for years with no updates.

mccabete commented 2 months ago

Ooof I see that you are correct, Zeb. Which does explain why I was only seeing alerts for boreal runs.

mccabete commented 2 months ago

A quick alarm system -- checking if the API has data from yesterday. That's more or less exactly what I do whenever I want to double check how our data looks.

Earth-Information-System / fireatlas

FEDS needs a monitoring system #140