Closed thekaveman closed 1 month ago
Warehouse report 📦
Legend (in order of precedence)
Resource type | Indicator | Resolution |
---|---|---|
Large table-materialized model | Orange | Make the model incremental |
Large model without partitioning or clustering | Orange | Add partitioning and/or clustering |
View with more than one child | Yellow | Materialize as a table or incremental |
Incremental | Light green | |
Table | Green | |
View | White |
Can you output the logs of
dbt run
to ensure this works properly? See #3502 for an example of how this is done.
@evansiroky @vevetron I'm following these instructions: https://github.com/cal-itp/data-infra/blob/main/warehouse/README.md
And I have to say, this is just a brutal developer experience...
poetry
installed to be able to run poetry install
graphviz
installed to be able to install pygraphviz
(from poetry install
)brew
, which is a MacOS tool. I'm on Linux.devcontainer
configDoes everyone run this on a Mac? I've tried to update the devcontainer
to be able to get all this running locally. I got as far as:
poetry
and brew
brew install graphviz
export CFLAGS...
and export LDFLAGS
mentioned in the above READMEBut I still get an error when running poetry install
at the pygraphviz
step:
/workspaces/data-infra/warehouse$ echo $CFLAGS
-I /home/linuxbrew/.linuxbrew/opt/graphviz/include
/workspaces/data-infra/warehouse$ echo $LDFLAGS
-L /home/linuxbrew/.linuxbrew/opt/graphviz/lib
/workspaces/data-infra/warehouse$ poetry install
The currently activated Python version 3.8.17 is not supported by the project (~3.9).
Trying to find and use a compatible version.
Using python3.9 (3.9.2)
Installing dependencies from lock file
Package operations: 1 install, 0 updates, 0 removals
- Installing pygraphviz (1.11): Failed
...
creating build/temp.linux-x86_64-cpython-39/pygraphviz
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -ffile-prefix-map=/build/python3.9-RNBry6/python3.9-3.9.2=. -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -I /home/linuxbrew/.linuxbrew/opt/graphviz/include -fPIC -DSWIG_PYTHON_STRICT_BYTE_CHAR -I/tmp/tmpuiqe4_ep/.venv/include -I/usr/include/python3.9 -c pygraphviz/graphviz_wrap.c -o build/temp.linux-x86_64-cpython-39/pygraphviz/graphviz_wrap.o
pygraphviz/graphviz_wrap.c:168:11: fatal error: Python.h: No such file or directory
168 | # include <Python.h>
| ^~~~~~~~~~
Any idea how to get this working?
Alternatively, if you all are already setup to run these DBT commands for verification, that would be really helpful.
I think everyone who works with DBT right now either uses a local mac or jupyterhub to run and test changes. Linux should work as well, but I don't think anyone is using devcontainers.
Thanks @vevetron. I got a hold of a Macbook and got as far as running poetry run dbt debug
but it gave me this output:
(.venv) kegans-MBP:warehouse kegan$ poetry run dbt debug
20:17:52 Running with dbt=1.5.1
20:17:52 dbt version: 1.5.1
20:17:52 python version: 3.9.6
20:17:52 python path: /Users/kegan/git/data-infra/warehouse/.venv/bin/python
20:17:52 os info: macOS-14.2-arm64-arm-64bit
20:17:52 Using profiles.yml file at /Users/kegan/.dbt/profiles.yml
20:17:52 Using dbt_project.yml file at /Users/kegan/git/data-infra/warehouse/dbt_project.yml
20:17:52 Configuration:
20:17:52 Error importing adapter: No module named 'dbt.adapters.bigquery'
20:17:52 profiles.yml file [ERROR invalid]
20:17:52 dbt_project.yml file [OK found and valid]
20:17:52 Required dependencies:
20:17:52 - git [OK found]
20:17:52 1 check failed:
20:17:52 Profile loading failed for the following reason:
Runtime Error
Credentials in profile "calitp_warehouse", target "dev" invalid: Runtime Error
Could not find adapter type bigquery!
My ~/.dbt/profiles.yml
file looks like:
calitp_warehouse:
outputs:
dev:
dataproc_batch:
runtime_config:
container_image: gcr.io/cal-itp-data-infra/dbt-spark:2023.3.28
properties:
spark.dynamicAllocation.maxExecutors: '16'
spark.executor.cores: '4'
spark.executor.instances: '4'
spark.executor.memory: 4g
dataproc_region: us-west2
fixed_retries: 1
gcs_bucket: test-calitp-dbt-python-models
location: us-west2
maximum_bytes_billed: 2000000000000
method: oauth
priority: interactive
project: cal-itp-data-infra-staging
schema: kegan
submission_method: serverless
threads: 8
timeout_seconds: 3000
type: bigquery
target: dev
And bq ls
has output that seems like I have a connection:
datasetId
----------------------------------------
airtable
amplitude
audit
calitp_py
charlie
charlie_dbt_test__audit
charlie_gtfs_schedule
charlie_gtfs_views_staging
charlie_intermediate
charlie_mart_ad_hoc
charlie_mart_agency_service
charlie_mart_feed_aggregator_checks
charlie_mart_gtfs
charlie_mart_gtfs_guidelines
charlie_mart_gtfs_quality
charlie_mart_ntd
charlie_mart_payments
charlie_mart_transit_database
charlie_payments
charlie_staging
charlie_views
christian
christian_mart_ad_hoc
christian_mart_audit
christian_mart_benefits
christian_mart_gtfs
christian_mart_gtfs_quality
christian_mart_gtfs_schedule_latest
christian_mart_ntd
christian_mart_payments
christian_mart_transit_database
christian_mart_transit_database_latest
christian_staging
ci_staging
eric
eric_mart_ad_hoc
eric_mart_audit
eric_mart_benefits
eric_mart_gtfs
eric_mart_gtfs_quality
eric_mart_gtfs_schedule_latest
eric_mart_ntd
eric_mart_payments
eric_mart_transit_database
eric_mart_transit_database_latest
eric_payments
eric_staging
eric_views
erika
erika_dbt_test__audit
Will come back to this a little later and look into it more.
Your profiles.yml looks exactly the same as mine. My debug statement is almost the same as well.
Maybe retry poetry install
? or pip install dbt-bigquery
? or maybe it's running the wrong environment.
Finally got it running!
I am seeing the same error output that you showed:
$ poetry run dbt run -s +fct_benefits_events
19:32:16 Running with dbt=1.5.1
19:32:16 [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.calitp_warehouse.mart.ad_hoc
19:32:17 Found 420 models, 950 tests, 0 snapshots, 0 analyses, 852 macros, 0 operations, 12 seed files, 175 sources, 4 exposures, 0 metrics, 0 groups
19:32:17
19:32:20 Concurrency: 8 threads (target='dev')
19:32:20
19:32:20 1 of 2 START sql view model kegan_staging.stg_amplitude__benefits_events ....... [RUN]
19:32:21 1 of 2 OK created sql view model kegan_staging.stg_amplitude__benefits_events .. [CREATE VIEW (0 processed) in 1.26s]
19:32:21 2 of 2 START sql table model kegan_mart_benefits.fct_benefits_events ........... [RUN]
19:32:23 BigQuery adapter: https://console.cloud.google.com/bigquery?project=cal-itp-data-infra-staging&j=bq:us-west2:ee6d3a66-62ef-49c5-818c-709b8d75e98a&page=queryresults
19:32:23 2 of 2 ERROR creating sql table model kegan_mart_benefits.fct_benefits_events .. [ERROR in 2.17s]
19:32:23
19:32:23 Finished running 1 view model, 1 table model in 0 hours 0 minutes and 6.64 seconds (6.64s).
19:32:23
19:32:23 Completed with 1 error and 0 warnings:
19:32:23
19:32:23 Database Error in model fct_benefits_events (models/mart/benefits/fct_benefits_events.sql)
19:32:23 Unrecognized name: event_properties_claims_provider at [158:9]
19:32:23 compiled Code at target/run/calitp_warehouse/models/mart/benefits/fct_benefits_events.sql
19:32:23
19:32:23 Done. PASS=1 WARN=0 ERROR=1 SKIP=0 TOTAL=2
Will work on getting these corrected.
@vevetron I updated the PR description with the results of running locally, which is now passing.
Description
We recently completed a big refactor of the models in Benefits, see cal-itp/benefits#1666 for more background.
The last piece of this refactor is updating our new and historic analytics events. The following PRs update the logic for generating new events:
And this PR is for the warehouse side, to handle the new fields and adjust historical data already captured in GCS.
We don't want to merge this PR until all of the above PRs are merged and released to our
prod
environment.Closes cal-itp/benefits#2247 Closes cal-itp/benefits#2248 Closes cal-itp/benefits#2249 Closes cal-itp/benefits#2390
Type of change
How has this been tested?
Post-merge follow-ups
Document any actions that must be taken post-merge to deploy or otherwise implement the changes in this PR (for example, running a full refresh of some incremental model in dbt). If these actions will take more than a few hours after the merge or if they will be completed by someone other than the PR author, please create a dedicated follow-up issue and link it here to track resolution.
eligibility_verifier
, and update to the new values