cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Refactor: Benefits Amplitude events #3468

Closed thekaveman closed 1 month ago

thekaveman commented 2 months ago

Description

We recently completed a big refactor of the models in Benefits, see cal-itp/benefits#1666 for more background.

The last piece of this refactor is updating our new and historic analytics events. The following PRs update the logic for generating new events:

And this PR is for the warehouse side, to handle the new fields and adjust historical data already captured in GCS.

We don't want to merge this PR until all of the above PRs are merged and released to our prod environment.

Closes cal-itp/benefits#2247 Closes cal-itp/benefits#2248 Closes cal-itp/benefits#2249 Closes cal-itp/benefits#2390

Type of change

How has this been tested?

 poetry run dbt run -s +fct_benefits_events
$ poetry run dbt run -s +fct_benefits_events
19:38:34  Running with dbt=1.5.1
19:38:35  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
19:38:35  Found 420 models, 950 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
19:38:35  
19:39:53  Concurrency: 8 threads (target='dev')
19:39:53  
19:39:53  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
19:39:54  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.22s]
19:39:54  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
19:40:15  2 of 2 OK created sql table model kegan_mart_benefits.fct_benefits_events ...... [CREATE TABLE (26.9m rows, 73.1 GiB processed) in 20.37s]
19:40:15  
19:40:15  Finished running 1 view model, 1 table model in 0 hours 1 minutes and 39.82 seconds (99.82s).
19:40:15  
19:40:15  Completed successfully
19:40:15  
19:40:15  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

Post-merge follow-ups

Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.

github-actions[bot] commented 2 months ago

Warehouse report 📦

DAG

Legend (in order of precedence)

Resource type Indicator Resolution
Large table-materialized model Orange Make the model incremental
Large model without partitioning or clustering Orange Add partitioning and/or clustering
View with more than one child Yellow Materialize as a table or incremental
Incremental Light green
Table Green
View White

thekaveman commented 1 month ago

Can you output the logs of dbt run to ensure this works properly? See #3502 for an example of how this is done.

@evansiroky @vevetron I'm following these instructions: https://github.com/cal-itp/data-infra/blob/main/warehouse/README.md

And I have to say, this is just a brutal developer experience...

Does everyone run this on a Mac? I've tried to update the devcontainer to be able to get all this running locally. I got as far as:

But I still get an error when running poetry install at the pygraphviz step:

/workspaces/data-infra/warehouse$ echo $CFLAGS
-I /home/linuxbrew/.linuxbrew/opt/graphviz/include

/workspaces/data-infra/warehouse$ echo $LDFLAGS
-L /home/linuxbrew/.linuxbrew/opt/graphviz/lib

/workspaces/data-infra/warehouse$ poetry install
The currently activated Python version 3.8.17 is not supported by the project (~3.9).
Trying to find and use a compatible version. 
Using python3.9 (3.9.2)
Installing dependencies from lock file

Package operations: 1 install, 0 updates, 0 removals

  - Installing pygraphviz (1.11): Failed

...

creating build/temp.linux-x86_64-cpython-39/pygraphviz
  x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-RNBry6/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -I /home/linuxbrew/.linuxbrew/opt/graphviz/include -fPIC -DSWIG_PYTHON_STRICT_BYTE_CHAR -I/tmp/tmpuiqe4_ep/.venv/include -I/usr/include/python3.9 -c pygraphviz/graphviz_wrap.c -o build/temp.linux-x86_64-cpython-39/pygraphviz/graphviz_wrap.o
  pygraphviz/graphviz_wrap.c:168:11: fatal error: Python.h: No such file or directory
    168 | # include <Python.h>
        |           ^~~~~~~~~~

Any idea how to get this working?

thekaveman commented 1 month ago

Alternatively, if you all are already setup to run these DBT commands for verification, that would be really helpful.

vevetron commented 1 month ago

I think everyone who works with DBT right now either uses a local mac or jupyterhub to run and test changes. Linux should work as well, but I don't think anyone is using devcontainers.

thekaveman commented 1 month ago

Thanks @vevetron. I got a hold of a Macbook and got as far as running poetry run dbt debug but it gave me this output:

(.venv) kegans-MBP:warehouse kegan$ poetry run dbt debug
20:17:52  Running with dbt=1.5.1
20:17:52  dbt version: 1.5.1
20:17:52  python version: 3.9.6
20:17:52  python path: /Users/kegan/git/data-infra/warehouse/.venv/bin/python
20:17:52  os info: macOS-14.2-arm64-arm-64bit
20:17:52  Using profiles.yml file at /Users/kegan/.dbt/profiles.yml
20:17:52  Using dbt_project.yml file at /Users/kegan/git/data-infra/warehouse/dbt_project.yml
20:17:52  Configuration:
20:17:52  Error importing adapter: No module named 'dbt.adapters.bigquery'
20:17:52    profiles.yml file [ERROR invalid]
20:17:52    dbt_project.yml file [OK found and valid]
20:17:52  Required dependencies:
20:17:52   - git [OK found]

20:17:52  1 check failed:
20:17:52  Profile loading failed for the following reason:
Runtime Error
  Credentials in profile "calitp_warehouse", target "dev" invalid: Runtime Error
    Could not find adapter type bigquery!

My ~/.dbt/profiles.yml file looks like:

calitp_warehouse:
  outputs:
    dev:
      dataproc_batch:
        runtime_config:
          container_image: gcr.io/cal-itp-data-infra/dbt-spark:2023.3.28
          properties:
            spark.dynamicAllocation.maxExecutors: '16'
            spark.executor.cores: '4'
            spark.executor.instances: '4'
            spark.executor.memory: 4g
      dataproc_region: us-west2
      fixed_retries: 1
      gcs_bucket: test-calitp-dbt-python-models
      location: us-west2
      maximum_bytes_billed: 2000000000000
      method: oauth
      priority: interactive
      project: cal-itp-data-infra-staging
      schema: kegan
      submission_method: serverless
      threads: 8
      timeout_seconds: 3000
      type: bigquery
  target: dev

And bq ls has output that seems like I have a connection:

                datasetId                 
 ---------------------------------------- 
  airtable                                
  amplitude                               
  audit                                   
  calitp_py                               
  charlie                                 
  charlie_dbt_test__audit                 
  charlie_gtfs_schedule                   
  charlie_gtfs_views_staging              
  charlie_intermediate                    
  charlie_mart_ad_hoc                     
  charlie_mart_agency_service             
  charlie_mart_feed_aggregator_checks     
  charlie_mart_gtfs                       
  charlie_mart_gtfs_guidelines            
  charlie_mart_gtfs_quality               
  charlie_mart_ntd                        
  charlie_mart_payments                   
  charlie_mart_transit_database           
  charlie_payments                        
  charlie_staging                         
  charlie_views                           
  christian                               
  christian_mart_ad_hoc                   
  christian_mart_audit                    
  christian_mart_benefits                 
  christian_mart_gtfs                     
  christian_mart_gtfs_quality             
  christian_mart_gtfs_schedule_latest     
  christian_mart_ntd                      
  christian_mart_payments                 
  christian_mart_transit_database         
  christian_mart_transit_database_latest  
  christian_staging                       
  ci_staging                              
  eric                                    
  eric_mart_ad_hoc                        
  eric_mart_audit                         
  eric_mart_benefits                      
  eric_mart_gtfs                          
  eric_mart_gtfs_quality                  
  eric_mart_gtfs_schedule_latest          
  eric_mart_ntd                           
  eric_mart_payments                      
  eric_mart_transit_database              
  eric_mart_transit_database_latest       
  eric_payments                           
  eric_staging                            
  eric_views                              
  erika                                   
  erika_dbt_test__audit

Will come back to this a little later and look into it more.

vevetron commented 1 month ago

Your profiles.yml looks exactly the same as mine. My debug statement is almost the same as well.

Maybe retry poetry install? or pip install dbt-bigquery? or maybe it's running the wrong environment.

thekaveman commented 1 month ago

Finally got it running!

I am seeing the same error output that you showed:

$ poetry run dbt run -s +fct_benefits_events
19:32:16  Running with dbt=1.5.1
19:32:16  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
19:32:17  Found 420 models, 950 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
19:32:17  
19:32:20  Concurrency: 8 threads (target='dev')
19:32:20  
19:32:20  1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
19:32:21  1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.26s]
19:32:21  2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
19:32:23  BigQuery adapter: https://console.cloud.google.com/bigquery?project=cal-itp-data-infra-staging&j=bq:us-west2:ee6d3a66-62ef-49c5-818c-709b8d75e98a&page=queryresults
19:32:23  2 of 2 ERROR creating sql table model kegan_mart_benefits.fct_benefits_events .. [ERROR in 2.17s]
19:32:23  
19:32:23  Finished running 1 view model, 1 table model in 0 hours 0 minutes and 6.64 seconds (6.64s).
19:32:23  
19:32:23  Completed with 1 error and 0 warnings:
19:32:23  
19:32:23  Database Error in model fct_benefits_events (models/mart/benefits/fct_benefits_events.sql)
19:32:23    Unrecognized name: event_properties_claims_provider at [158:9]
19:32:23    compiled Code at target/run/calitp_warehouse/models/mart/benefits/fct_benefits_events.sql
19:32:23  
19:32:23  Done. PASS=1 WARN=0 ERROR=1 SKIP=0 TOTAL=2

Will work on getting these corrected.

thekaveman commented 1 month ago

@vevetron I updated the PR description with the results of running locally, which is now passing.