cal-itp / reports

GTFS data quality reports for California transit providers
https://reports.calitp.org
GNU Affero General Public License v3.0
7 stars 0 forks source link

Refactor idea: Move queries of separate data artifacts into a single dbt model #277

Open atvaccaro opened 1 year ago

atvaccaro commented 1 year ago

Currently, the generate_reports_data.py script queries several different tables (e.g. fct_monthly_reports_site_organization_gtfs_vendors and fct_daily_reports_site_organization_scheduled_service_summary) which are processed and "joined" together by being written into the same output folders. Rather than try to combine these artifacts and/or add validation with something like Pydantic on top of these existing queries, It should be possible to create a single dbt model whose grain is year-month-itp_id so rows are 1:1 with final report pages. BigQuery rows can contain JSON and arrays to represent the nested nature of some of this data.

If this model is implemented, the "data generation" script could consist of just querying this single model and writing a single artifact (with some additional fields added post-query, such as RT feed URLs, that are more difficult to do in BigQuery).