WikiWatershed / model-my-watershed

The web application front end for Model My Watershed.
https://modelmywatershed.org
Apache License 2.0
57 stars 31 forks source link

Add Job Performance Metrics #3596

Open rajadain opened 1 year ago

rajadain commented 1 year ago

Add a CloudWatch (or other) dashboard to monitor app-specific metrics. We run these manually from time to time, but should be available on demand to monitor ongoing health of the application.

Once such metrics are in place, we can also consider adding alarms for when they exceed some bounds.

Examples of current metrics:

rajadain commented 1 year ago

Noticing that the average job runtime is going up, seems to be reset with every production deployment:

SELECT DATE(created_at), EXTRACT(EPOCH FROM AVG(delivered_at - created_at)) AS avg_seconds
FROM core_job
WHERE created_at >= '2023-01-01'
GROUP BY DATE(created_at)
HAVING EXTRACT(EPOCH FROM AVG(delivered_at - created_at)) < 700 -- exclude anomalies
ORDER BY DATE(created_at);

image

rajadain commented 1 year ago

Another version of the query above, but looking at p75 and p95 metrics instead of the average:

WITH query_durations AS (
    SELECT DATE(created_at), EXTRACT(EPOCH FROM delivered_at - created_at) AS duration
    FROM core_job
    WHERE created_at >= '2023-01-01'
)

SELECT date,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration) AS p75,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY duration) AS p95
FROM query_durations
WHERE date NOT IN (
    -- Exclude dates with exceptional values to better see trend
    '2023-01-16',
    '2023-02-16',
    '2023-02-20',
    '2023-02-22')
GROUP BY date
ORDER BY date;

image

rajadain commented 1 year ago

Another version of the above

image
rajadain commented 1 month ago

Latest version of the above:

image

Interesting that after the last production deployment, the line was flat for a bit before it started going up again. Now it's higher than it has been in record. Should likely do another production deploy to fix this.