cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Add CTEs and optimize fct_daily_rt_feed_validation_notices #3509

Closed erikamov closed 1 month ago

erikamov commented 1 month ago

Description

This PR is a second pass on optimizing fct_daily_rt_feed_validation_notices to fix issue #3481.

@ohrite and I split the query into 4 CTEs

This reduced the runtime to < 1m. Since the restructure represents a fundamental change in how the data is aggregated, we are working on validating the output.

Type of change

How has this been tested?

We've queried both production and staging tables, compared row values for 2023-12-28, and generated csv files for 2024-10-16, comparing both files through DIFF and had zero differences.

Used SQLFluff to check the query:

sqlfluff lint warehouse/models/mart/gtfs_quality/fct_daily_rt_feed_validation_notices.sql
All Finished 📜 🎉!

The table was successfully created on staging:

❯ poetry run dbt run -s +"models/mart/gtfs_quality/fct_daily_rt_feed_validation_notices.sql"
21:48:05  Running with dbt=1.5.1
21:48:07  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
21:48:07  Found 420 models, 949 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
21:48:07
21:48:10  Concurrency: 8 threads (target='dev')
21:48:10
21:48:10  1 of 28 START sql view model erika_staging.stg_gtfs_quality__rt_validation_code_descriptions  [RUN]
21:48:10  2 of 28 START sql view model erika_staging.stg_gtfs_rt__service_alerts_outcomes  [RUN]
21:48:10  3 of 28 START sql view model erika_staging.stg_gtfs_rt__service_alerts_validation_notices  [RUN]
21:48:10  4 of 28 START sql view model erika_staging.stg_gtfs_rt__service_alerts_validation_outcomes  [RUN]
21:48:10  5 of 28 START sql view model erika_staging.stg_gtfs_rt__trip_updates_outcomes .. [RUN]
21:48:10  6 of 28 START sql view model erika_staging.stg_gtfs_rt__trip_updates_validation_notices  [RUN]
21:48:10  7 of 28 START sql view model erika_staging.stg_gtfs_rt__trip_updates_validation_outcomes  [RUN]
21:48:10  8 of 28 START sql view model erika_staging.stg_gtfs_rt__vehicle_positions_outcomes  [RUN]
21:48:11  7 of 28 OK created sql view model erika_staging.stg_gtfs_rt__trip_updates_validation_outcomes  [CREATE VIEW (0 processed) in 0.99s]
21:48:11  9 of 28 START sql view model erika_staging.stg_gtfs_rt__vehicle_positions_validation_notices  [RUN]
21:48:11  2 of 28 OK created sql view model erika_staging.stg_gtfs_rt__service_alerts_outcomes  [CREATE VIEW (0 processed) in 1.02s]
21:48:11  6 of 28 OK created sql view model erika_staging.stg_gtfs_rt__trip_updates_validation_notices  [CREATE VIEW (0 processed) in 1.02s]
21:48:11  10 of 28 START sql view model erika_staging.stg_gtfs_rt__vehicle_positions_validation_outcomes  [RUN]
21:48:11  11 of 28 START sql view model erika_staging.stg_gtfs_schedule__agency .......... [RUN]
21:48:11  8 of 28 OK created sql view model erika_staging.stg_gtfs_rt__vehicle_positions_outcomes  [CREATE VIEW (0 processed) in 1.05s]
21:48:11  4 of 28 OK created sql view model erika_staging.stg_gtfs_rt__service_alerts_validation_outcomes  [CREATE VIEW (0 processed) in 1.05s]
21:48:11  3 of 28 OK created sql view model erika_staging.stg_gtfs_rt__service_alerts_validation_notices  [CREATE VIEW (0 processed) in 1.06s]
21:48:11  12 of 28 START sql view model erika_staging.stg_gtfs_schedule__download_outcomes  [RUN]
21:48:11  13 of 28 START sql view model erika_staging.stg_gtfs_schedule__file_parse_outcomes  [RUN]
21:48:11  14 of 28 START sql view model erika_staging.stg_gtfs_schedule__unzip_outcomes .. [RUN]
21:48:11  1 of 28 OK created sql view model erika_staging.stg_gtfs_quality__rt_validation_code_descriptions  [CREATE VIEW (0 processed) in 1.10s]
21:48:11  15 of 28 START sql view model erika_staging.stg_transit_database__gtfs_datasets  [RUN]
21:48:11  5 of 28 OK created sql view model erika_staging.stg_gtfs_rt__trip_updates_outcomes  [CREATE VIEW (0 processed) in 1.18s]
21:48:12  16 of 28 START sql incremental model erika_staging.int_gtfs_rt__unioned_parse_outcomes  [RUN]
21:48:12  14 of 28 OK created sql view model erika_staging.stg_gtfs_schedule__unzip_outcomes  [CREATE VIEW (0 processed) in 0.72s]
21:48:12  10 of 28 OK created sql view model erika_staging.stg_gtfs_rt__vehicle_positions_validation_outcomes  [CREATE VIEW (0 processed) in 0.78s]
21:48:12  17 of 28 START sql incremental model erika_staging.int_gtfs_quality__rt_validation_outcomes  [RUN]
21:48:12  9 of 28 OK created sql view model erika_staging.stg_gtfs_rt__vehicle_positions_validation_notices  [CREATE VIEW (0 processed) in 0.95s]
21:48:12  18 of 28 START sql incremental model erika_staging.int_gtfs_quality__rt_validation_notices  [RUN]
21:48:12  11 of 28 OK created sql view model erika_staging.stg_gtfs_schedule__agency ..... [CREATE VIEW (0 processed) in 0.93s]
21:48:12  12 of 28 OK created sql view model erika_staging.stg_gtfs_schedule__download_outcomes  [CREATE VIEW (0 processed) in 0.99s]
21:48:12  13 of 28 OK created sql view model erika_staging.stg_gtfs_schedule__file_parse_outcomes  [CREATE VIEW (0 processed) in 1.02s]
21:48:12  19 of 28 START sql view model erika_staging.int_gtfs_schedule__grouped_feed_file_parse_outcomes  [RUN]
21:48:12  15 of 28 OK created sql view model erika_staging.stg_transit_database__gtfs_datasets  [CREATE VIEW (0 processed) in 1.04s]
21:48:12  20 of 28 START sql table model erika_staging.int_transit_database__gtfs_datasets_dim  [RUN]
21:48:14  19 of 28 OK created sql view model erika_staging.int_gtfs_schedule__grouped_feed_file_parse_outcomes  [CREATE VIEW (0 processed) in 1.19s]
21:48:14  21 of 28 START sql view model erika_staging.int_gtfs_schedule__joined_feed_outcomes  [RUN]
21:48:15  21 of 28 OK created sql view model erika_staging.int_gtfs_schedule__joined_feed_outcomes  [CREATE VIEW (0 processed) in 1.22s]
21:48:15  22 of 28 START sql table model erika_mart_gtfs.dim_schedule_feeds .............. [RUN]
21:48:17  20 of 28 OK created sql table model erika_staging.int_transit_database__gtfs_datasets_dim  [CREATE TABLE (4.9k rows, 4.9 GiB processed) in 4.30s]
21:48:17  23 of 28 START sql table model erika_mart_transit_database.bridge_schedule_dataset_for_validation  [RUN]
21:48:17  24 of 28 START sql table model erika_mart_transit_database.dim_gtfs_datasets ... [RUN]
21:48:19  23 of 28 OK created sql table model erika_mart_transit_database.bridge_schedule_dataset_for_validation  [CREATE TABLE (2.8k rows, 509.1 KiB processed) in 2.19s]
21:48:19  24 of 28 OK created sql table model erika_mart_transit_database.dim_gtfs_datasets  [CREATE TABLE (4.9k rows, 1.6 MiB processed) in 2.76s]
21:48:19  25 of 28 START sql table model erika_staging.int_transit_database__urls_to_gtfs_datasets  [RUN]
21:48:22  25 of 28 OK created sql table model erika_staging.int_transit_database__urls_to_gtfs_datasets  [CREATE TABLE (4.9k rows, 865.1 KiB processed) in 2.21s]
21:48:36  17 of 28 OK created sql incremental model erika_staging.int_gtfs_quality__rt_validation_outcomes  [SCRIPT (4.6 GiB processed) in 24.07s]
21:48:37  16 of 28 OK created sql incremental model erika_staging.int_gtfs_rt__unioned_parse_outcomes  [SCRIPT (3.8 GiB processed) in 24.60s]
21:49:27  22 of 28 OK created sql table model erika_mart_gtfs.dim_schedule_feeds ......... [CREATE TABLE (13.9k rows, 11.3 GiB processed) in 72.60s]
21:49:27  26 of 28 START sql table model erika_mart_gtfs.fct_daily_schedule_feeds ........ [RUN]
21:49:32  26 of 28 OK created sql table model erika_mart_gtfs.fct_daily_schedule_feeds ... [CREATE TABLE (277.4k rows, 2.8 MiB processed) in 4.14s]
21:49:32  27 of 28 START sql incremental model erika_mart_gtfs.fct_daily_rt_feed_files ... [RUN]
21:49:35  27 of 28 OK created sql incremental model erika_mart_gtfs.fct_daily_rt_feed_files  [MERGE (402.0 rows, 175.3 MiB processed) in 3.44s]
21:52:07  18 of 28 OK created sql incremental model erika_staging.int_gtfs_quality__rt_validation_notices  [SCRIPT (15.3 GiB processed) in 234.48s]
21:52:07  28 of 28 START sql table model erika_mart_gtfs_quality.fct_daily_rt_feed_validation_notices  [RUN]
21:52:20  28 of 28 OK created sql table model erika_mart_gtfs_quality.fct_daily_rt_feed_validation_notices  [CREATE TABLE (206.2k rows, 39.2 GiB processed) in 13.07s]
21:52:20
21:52:20  Finished running 17 view models, 7 table models, 4 incremental models in 0 hours 4 minutes and 12.53 seconds (252.53s).
21:52:20
21:52:20  Completed successfully
21:52:20
21:52:20  Done. PASS=28 WARN=0 ERROR=0 SKIP=0 TOTAL=28\

Post-merge follow-ups

Watch the next time the DAG runs to see the results and no more errors.

github-actions[bot] commented 1 month ago

Warehouse report 📦

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White