dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
216 stars 153 forks source link

[ADAP-903] [Feature] Convert bigquery python library method calls to pure SQL DDL/DML #930

Closed jeremyyeo closed 6 months ago

jeremyyeo commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Unlike other adapter, the dbt-bigquery adapter is special in that some operations are not performed via a SQL statement but rather via calling some method from the google-cloud-bigquery library.

For example, the dropping of an existing relation in this adapter calls upon the delete_table method:

https://github.com/dbt-labs/dbt-bigquery/blob/27ade3df84b040d4140c1870a033b20c2495b5cd/dbt/include/bigquery/macros/relations/drop.sql#L1-L3

https://github.com/dbt-labs/dbt-bigquery/blob/27ade3df84b040d4140c1870a033b20c2495b5cd/dbt/adapters/bigquery/impl.py#L232-L240

Snowflake (Redshift + Postgres as well) will issue straightforward DDL:

drop table ...

https://github.com/dbt-labs/dbt-snowflake/blob/82b2acbf36099ea28a22032e98e95dd888e90bbd/dbt/include/snowflake/macros/relations/drop.sql

https://github.com/dbt-labs/dbt-core/blob/d597b80486ebb409685fafb23d972f5361ddf7fc/core/dbt/include/global_project/macros/relations/table/drop.sql

https://github.com/dbt-labs/dbt-redshift/blob/08625fe99db3bcaac94702719bda42ce168ed0ec/dbt/include/redshift/macros/relations/table/drop.sql

The outcome of this is that it get's quite a bit harder to do debugging. Recently, I was helping a user to debug a potential duplicate in their seeds - i.e. seeds were simply being "inserted into" without first being dropped.

Unless there's a bug, the typical seed operation involves dropping/truncating the existing seed table if it exist and then creating it again with whatever is in the csv file.

The exercises below use a simple seed (seeds/people.csv) like so:

id,name
1,alice

Let's look at the debug logs for a seed that already exist:

$ dbt --debug seed --target bq

04:30:32  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x110b2ff70>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1127b28e0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1127eb040>]}
04:30:32  Running with dbt=1.6.3
04:30:32  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'log_cache_events': 'False', 'write_json': 'True', 'partial_parse': 'False', 'cache_selected_only': 'False', 'warn_error': 'None', 'version_check': 'True', 'fail_fast': 'False', 'log_path': '/Users/jeremy/src/dbt-basic/logs', 'debug': 'True', 'profiles_dir': '/Users/jeremy/.dbt', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'introspect': 'True', 'static_parser': 'True', 'invocation_command': 'dbt --debug seed --target bq', 'target_path': 'None', 'log_format': 'default', 'send_anonymous_usage_stats': 'True'}
04:30:33  Sending event: {'category': 'dbt', 'action': 'project_id', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x127e84a30>]}
04:30:33  Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x127ebe640>]}
04:30:33  Registered adapter: bigquery=1.6.5
04:30:34  checksum: 8c18be9970938b790a6784a5d9527920af98bf421e0f22a920a12e8d16d2bc2c, vars: {}, profile: , target: bq, version: 1.6.3
04:30:34  Partial parsing not enabled
04:30:35  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.my_dbt_project
04:30:35  Sending event: {'category': 'dbt', 'action': 'load_project', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x128fc00d0>]}
04:30:35  Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x128fc0970>]}
04:30:35  Found 1 seed, 0 sources, 0 exposures, 0 metrics, 390 macros, 0 groups, 0 semantic models
04:30:35  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x128f642e0>]}
04:30:35  
04:30:35  Acquiring new bigquery connection 'master'
04:30:35  Acquiring new bigquery connection 'list_cse-sandbox-319708'
04:30:35  Opening a new connection, currently in state init
04:30:38  Re-using an available connection from the pool (formerly list_cse-sandbox-319708, now list_cse-sandbox-319708_dbt_jyeo)
04:30:38  Opening a new connection, currently in state closed
04:31:50  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x128fc08e0>]}
04:31:50  Concurrency: 1 threads (target='bq')
04:31:50  
04:31:50  Began running node seed.my_dbt_project.people
04:31:50  1 of 1 START seed file dbt_jyeo.people ......................................... [RUN]
04:31:50  Re-using an available connection from the pool (formerly list_cse-sandbox-319708_dbt_jyeo, now seed.my_dbt_project.people)
04:31:50  Began compiling node seed.my_dbt_project.people
04:31:50  Timing info for seed.my_dbt_project.people (compile): 16:31:50.640168 => 16:31:50.640174
04:31:50  Began executing node seed.my_dbt_project.people
04:31:50  Opening a new connection, currently in state closed
04:31:57  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "bq", "node_id": "seed.my_dbt_project.people"} */

    alter table `cse-sandbox-319708`.`dbt_jyeo`.`people` set OPTIONS()

04:31:58  BigQuery adapter: https://console.cloud.google.com/bigquery?project=cse-sandbox-319708&j=bq:US:5d2c4568-1bc9-423d-aa14-a038477e6c96&page=queryresults
04:31:58  Writing runtime SQL for node "seed.my_dbt_project.people"
04:31:58  Timing info for seed.my_dbt_project.people (execute): 16:31:50.641426 => 16:31:58.737573
04:31:58  Sending event: {'category': 'dbt', 'action': 'run_model', 'label': 'fd612852-a7f4-4299-8994-fe4d396a40db', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x128fc08e0>]}
04:31:58  1 of 1 OK loaded seed file dbt_jyeo.people ..................................... [INSERT 1 in 8.10s]
04:31:58  Finished running node seed.my_dbt_project.people
04:31:58  Connection 'master' was properly closed.
04:31:58  Connection 'seed.my_dbt_project.people' was properly closed.
04:31:58  
04:31:58  Finished running 1 seed in 0 hours 1 minutes and 23.55 seconds (83.55s).
04:31:58  Command end result
04:31:58  
04:31:58  Completed successfully
04:31:58  
04:31:58  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
04:31:58  Command `dbt seed` succeeded at 16:31:58.760304 after 86.39 seconds
04:31:58  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x110b2ff70>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x127edba90>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x127ec55b0>]}
04:31:58  Flushing usage events

^ There's basically no indicator as to whether the seed that already exist was dropped prior to it being recreated or not.

Let's look at the project history from the BQ UI:

image

There's a LOAD operation - which presumably is the insertion of the data into table. And then the subsequent QUERY operation which is the alter table ... set OPTIONS() statement.

It turns out if we go to GCP Logging:

image

We do see the delete operation there.

I do think this is a bit cumbersome to debug though - when we compare it to doing the same thing in Snowflake.

$ dbt --debug seed --target sf

04:40:45  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x103cc2820>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x102d16b50>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1068c4e80>]}
04:40:45  Running with dbt=1.6.3
04:40:45  running dbt with arguments {'printer_width': '80', 'indirect_selection': 'eager', 'write_json': 'True', 'log_cache_events': 'False', 'partial_parse': 'False', 'cache_selected_only': 'False', 'warn_error': 'None', 'fail_fast': 'False', 'debug': 'True', 'log_path': '/Users/jeremy/src/dbt-basic/logs', 'version_check': 'True', 'profiles_dir': '/Users/jeremy/.dbt', 'use_colors': 'True', 'use_experimental_parser': 'False', 'no_print': 'None', 'quiet': 'False', 'warn_error_options': 'WarnErrorOptions(include=[], exclude=[])', 'introspect': 'True', 'static_parser': 'True', 'invocation_command': 'dbt --debug seed --target sf', 'target_path': 'None', 'log_format': 'default', 'send_anonymous_usage_stats': 'True'}
04:40:46  Sending event: {'category': 'dbt', 'action': 'project_id', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1068c4250>]}
04:40:46  Sending event: {'category': 'dbt', 'action': 'adapter_info', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x1189f45b0>]}
04:40:46  Registered adapter: snowflake=1.6.2
04:40:46  checksum: c587c0ffcf737c4305eff99791570bd223af97d11319c6afcfc27af854f5ace9, vars: {}, profile: , target: sf, version: 1.6.3
04:40:46  Partial parsing not enabled
04:40:47  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.my_dbt_project
04:40:47  Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118a61f10>]}
04:40:47  Sending event: {'category': 'dbt', 'action': 'resource_counts', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118a61f10>]}
04:40:47  Found 1 seed, 0 sources, 0 exposures, 0 metrics, 372 macros, 0 groups, 0 semantic models
04:40:47  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118a613a0>]}
04:40:47  
04:40:47  Acquiring new snowflake connection 'master'
04:40:47  Acquiring new snowflake connection 'list_development'
04:40:47  Using snowflake connection "list_development"
04:40:47  On list_development: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "connection_name": "list_development"} */
show terse schemas in database development
    limit 10000
04:40:47  Opening a new connection, currently in state init
04:40:49  SQL status: SUCCESS 284 in 2.0 seconds
04:40:49  On list_development: Close
04:40:49  Re-using an available connection from the pool (formerly list_development, now list_development_dbt_jyeo)
04:40:49  Using snowflake connection "list_development_dbt_jyeo"
04:40:49  On list_development_dbt_jyeo: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "connection_name": "list_development_dbt_jyeo"} */
show terse objects in development.dbt_jyeo limit 10000
04:40:49  Opening a new connection, currently in state closed
04:40:51  SQL status: SUCCESS 7 in 2.0 seconds
04:40:51  On list_development_dbt_jyeo: Close
04:40:52  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118b9bc10>]}
04:40:52  Concurrency: 1 threads (target='sf')
04:40:52  
04:40:52  Began running node seed.my_dbt_project.people
04:40:52  1 of 1 START seed file dbt_jyeo.people ......................................... [RUN]
04:40:52  Re-using an available connection from the pool (formerly list_development_dbt_jyeo, now seed.my_dbt_project.people)
04:40:52  Began compiling node seed.my_dbt_project.people
04:40:52  Timing info for seed.my_dbt_project.people (compile): 16:40:52.167514 => 16:40:52.167519
04:40:52  Began executing node seed.my_dbt_project.people
04:40:52  Using snowflake connection "seed.my_dbt_project.people"
04:40:52  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "node_id": "seed.my_dbt_project.people"} */
BEGIN
04:40:52  Opening a new connection, currently in state closed
04:40:53  SQL status: SUCCESS 1 in 2.0 seconds
04:40:53  Using snowflake connection "seed.my_dbt_project.people"
04:40:53  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "node_id": "seed.my_dbt_project.people"} */
truncate table "DEVELOPMENT"."DBT_JYEO"."PEOPLE"
  ;
04:40:54  SQL status: SUCCESS 1 in 1.0 seconds
04:40:54  Using snowflake connection "seed.my_dbt_project.people"
04:40:54  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "node_id": "seed.my_dbt_project.people"} */
COMMIT
04:40:54  SQL status: SUCCESS 1 in 0.0 seconds
04:40:54  Using snowflake connection "seed.my_dbt_project.people"
04:40:54  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "node_id": "seed.my_dbt_project.people"} */
BEGIN
04:40:55  SQL status: SUCCESS 1 in 0.0 seconds
04:40:55  Using snowflake connection "seed.my_dbt_project.people"
04:40:55  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "node_id": "seed.my_dbt_project.people"} */
insert into development.dbt_jyeo.people (id, name) values
            (%s,%s)...
04:40:55  SQL status: SUCCESS 1 in 1.0 seconds
04:40:55  Using snowflake connection "seed.my_dbt_project.people"
04:40:55  On seed.my_dbt_project.people: /* {"app": "dbt", "dbt_version": "1.6.3", "profile_name": "all", "target_name": "sf", "node_id": "seed.my_dbt_project.people"} */
COMMIT
04:40:56  SQL status: SUCCESS 1 in 0.0 seconds
04:40:56  Writing runtime SQL for node "seed.my_dbt_project.people"
04:40:56  Timing info for seed.my_dbt_project.people (execute): 16:40:52.169780 => 16:40:56.457809
04:40:56  On seed.my_dbt_project.people: Close
04:40:57  Sending event: {'category': 'dbt', 'action': 'run_model', 'label': '3c83b04d-d95a-4945-9f78-5ab689f60c73', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x119a7c160>]}
04:40:57  1 of 1 OK loaded seed file dbt_jyeo.people ..................................... [INSERT 1 in 4.88s]
04:40:57  Finished running node seed.my_dbt_project.people
04:40:57  Connection 'master' was properly closed.
04:40:57  Connection 'seed.my_dbt_project.people' was properly closed.
04:40:57  
04:40:57  Finished running 1 seed in 0 hours 0 minutes and 9.77 seconds (9.77s).
04:40:57  Command end result
04:40:57  
04:40:57  Completed successfully
04:40:57  
04:40:57  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1
04:40:57  Command `dbt seed` succeeded at 16:40:57.075582 after 11.58 seconds
04:40:57  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x103cc2820>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118afb1f0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x118ab1e80>]}
04:40:57  Flushing usage events

^ From the one spot (debug logs) - we can see exactly all the DDL involved with the seed.

Describe alternatives you've considered

I suppose a user could rewrite their own:

 {% macro bigquery__drop_relation(relation) -%} 
     {% do run_query("drop table if exist" ~ relation) %} 
 {% endmacro %} 

Who will this benefit?

Folks who want to debug straight from the debug logs. By executing straight SQL drop/truncate table ... it becomes more obvious all the operations that dbt is performing (across all adapters - at least the "major" ones) for any type of node. Rather than having specific knowledge that you'd have to go digging into specialised tools (would perhaps consider GCP Logging one).

Are you interested in contributing this feature?

Sure

Anything else?

This is kind of the second time I've thought about this - the first time was in https://github.com/dbt-labs/dbt-bigquery/issues/886 where I wasn't actually seeing the drop table ... statements when looking a customers debug log - but that's because those drop operations were not actually invoked via SQL statments.

Top of mind is just this "relation dropping" operation but there may be others - have not looked too deeply yet.

github-actions[bot] commented 7 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 6 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.