ccao-data / data-architecture

Codebase for CCAO data infrastructure construction and management
https://ccao-data.github.io/data-architecture/
6 stars 4 forks source link

Create a script to compare row counts on the Open Data portal to Athena #115

Closed wrridgeway closed 1 year ago

wrridgeway commented 1 year ago

Our historical values asset completed its automated refresh on Sept 1, 2023, but for some reason only had ~9 million rather than ~ 40 million rows. We were not notified of a failure by email. This needs investigating.

UPDATE 10/5/2023 (from Dan)

We're hearing from some folks that the Socrata agent may be failing silently in some cases, leading to the Open Data portal being incomplete or not up-to-date. Since the Open Data portal is a critical asset, we should build checks to guard against such failures.

I propose that we create a short Python script that pulls the row count by year from each Open Data asset and compares it to the count of the source Athena view. Row counts can be gathered directly from the Socrata SoQL API, e.g.:

curl 'https://datacatalog.cookcountyil.gov/resource/vgzx-68gb.json?$query=SELECT%20COUNT(*),year%20GROUP%20BY%20year' | jq

We should run the script on a schedule using GitHub Actions, and use the same notification setup used in test-dbt-models.

Since current year row counts may not match the Open Data Portal exactly, we should use a "fuzzy" match for the most recent 2 years (+/- 2%) and an exact match for all years prior to that.

wrridgeway commented 1 year ago

uploaded w/o issue on oct 1st, just FYI

dfsnow commented 1 year ago

@jeancochrane @wrridgeway See updated issue description and title.