cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Fix: filter Benefits bot events #3547

Closed thekaveman closed 1 week ago

thekaveman commented 1 week ago

Description

TLDR; we can filter out about 26.5 million records (of roughly 27 million total!) from the raw Amplitude data that we don't need in the final warehouse fact table / model.

Full details in Slack thread: https://cal-itp.slack.com/archives/C037Y3UE71P/p1731533304019569

Type of change

How has this been tested?

Before this change

$ poetry run dbt run -s +fct_benefits_events
22:04:44  Running with dbt=1.5.1
22:04:46  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
22:04:46  Found 422 models, 963 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 174 sources, 4 exposures, 0 metrics, 0 groups
22:04:46  
22:04:49  Concurrency: 8 threads (target='dev')
22:04:49  
22:04:49  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
22:04:51  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.14s]
22:04:51  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
22:05:05  2 of 2 OK created sql table model kegan_mart_benefits.fct_benefits_events ...... [CREATE TABLE (26.9m rows, 73.1 GiB processed) in 14.09s]
22:05:05  
22:05:05  Finished running 1 view model, 1 table model in 0 hours 0 minutes and 18.59 seconds (18.59s).
22:05:05  
22:05:05  Completed successfully
22:05:05  
22:05:05  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Note: CREATE TABLE (26.9m rows, 73.1 GiB processed) in 14.09s

With this change

$ poetry run dbt run -s +fct_benefits_events
22:06:50  Running with dbt=1.5.1
22:06:52  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
22:06:52  Found 422 models, 963 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 174 sources, 4 exposures, 0 metrics, 0 groups
22:06:52  
22:06:56  Concurrency: 8 threads (target='dev')
22:06:56  
22:06:56  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
22:06:57  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.13s]
22:06:57  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
22:07:13  2 of 2 OK created sql table model kegan_mart_benefits.fct_benefits_events ...... [CREATE TABLE (402.1k rows, 73.1 GiB processed) in 15.43s]
22:07:13  
22:07:13  Finished running 1 view model, 1 table model in 0 hours 0 minutes and 20.37 seconds (20.37s).
22:07:13  
22:07:13  Completed successfully
22:07:13  
22:07:13  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Note: CREATE TABLE (402.1k rows, 73.1 GiB processed) in 15.43s

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

github-actions[bot] commented 1 week ago

Warehouse report 📦

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

angela-tran commented 1 week ago

@thekaveman Just curious, is it expected that the storage size of fct_benefits_events is still 7.8 GB? (at least according to https://github.com/cal-itp/data-infra/pull/3547#issuecomment-2474936523...)

thekaveman commented 1 week ago

@thekaveman Just curious, is it expected that the storage size of fct_benefits_events is still 7.8 GB? (at least according to #3547 (comment)...)

Yeah I have no idea what that means / represents. I guess I thought it would go down too... but :shrug: