cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Optimize fct_daily_rt_feed_validation_notices #3506

Closed erikamov closed 1 month ago

erikamov commented 1 month ago

Description

This PR optimizes fct_daily_rt_feed_validation_notices for issue #3481.

The structure of the query: 1) Get all daily feeds (a record of retrievals per URL and day) 2) For each URL/day, create a row for each error code (there are 57) 3) For each URL/day/code, aggregate the validation runs, aggregate the warning/error messages, and calculate some metrics After pairing with @ohrite we believe the issue is the location of the CROSS JOIN, and are investigating moving the cross join out to a CTE with the daily feeds table, resulting in a query that finishes at about the 50 minute mark.

After removing the join to notices, the query finishes at the 20 minute mark. Re-adding the join to notices bumps up to just under 40 minutes. Our ongoing investigation will be into speeding up the notice occurrence list aggregation.

Type of change

How has this been tested?

It was tested locally generating the table on staging cal-itp-data-infra-staging.erika_mart_gtfs_quality.fct_daily_rt_feed_validation_notices

❯ poetry run dbt run -s "models/mart/gtfs_quality/fct_daily_rt_feed_validation_notices.sql"
22:26:36  Running with dbt=1.5.1
22:26:38  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
22:26:39  Found 420 models, 949 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
22:26:39
22:26:41  Concurrency: 8 threads (target='dev')
22:26:41
22:26:41  1 of 1 START sql table model erika_mart_gtfs_quality.fct_daily_rt_feed_validation_notices  [RUN]
22:27:27  1 of 1 OK created sql table model erika_mart_gtfs_quality.fct_daily_rt_feed_validation_notices  [CREATE TABLE (183.3k rows, 33.5 GiB processed) in 46.58s]
22:27:27
22:27:27  Finished running 1 table model in 0 hours 0 minutes and 48.83 seconds (48.83s).
Optimize fct_daily_rt_feed_validation_notices
22:27:28
22:27:28  Completed successfully
22:27:28
22:27:28  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Post-merge follow-ups

Monitoring DAGs transform_warehouse and transform_warehouse_full_refresh_sunday to see if there are no more errors happening when building fct_daily_rt_feed_validation_notices.

github-actions[bot] commented 1 month ago

Warehouse report 📦

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White