dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
294 stars 119 forks source link

create_external_schema macro doesn't explicitly specify target database #205

Closed Thootje closed 1 year ago

Thootje commented 1 year ago

Describe the bug

We are using DBT Cloud (Team plan) to build a DBT project in Google BigQuery and use the dbt-external-tables package to be able to generate external tables for several Google Sheets. When having the Team plan, you are allowed to have only 1 project and 1 connection. So, to be able to manipulate the target database (or the target project in the case of BigQuery), we rely on some Jinja to output the external tables to dev and prod projects in Google Cloud.

Specifically, we set the database property in the source .yml files to be able to control where the external tables should be outputted to when running dbt run-operation stage_external_sources:

database: "{% if target.name == 'dev' %}dev-dbt-project{% elif target.name == 'prod' %}prod-dbt-project{% endif %}"

However, when we execute this command in a job with a target.name of dev, the external tables are still outputted to the production project. This is despite having logging set up that clearly shows our target.name is dev. Furthermore, the schema is correctly generated depending on the target.name using exactly the same logic as the jinja being used to generate the schema name.

Steps to reproduce

To reproduce you need to do the following:

sources:

Expected results

I expect the external table to be generated in the dbt-project-dev project with a schema name of dev_staging and a table name of customers. Because the dev_staging schema is not there yet, it will first have to create that schema before being able to create the external table.

Actual results

The dbt run-operation stage_external_sources fails and mentions that the dbt-project-dev.dev_staging schema can't be found. However, when you check the DDL statement is uses to create the table, both the database name and the schema name is correctly generated by the jinja in the source .yml file, however the dataset can't be found.

create or replace external table dbt-project-dev.dev_staging.customers

Also, a dataset of dev_staging is created in the dbt-project-prod project, which is the wrong GCP project.

Screenshots and log output

Complete log after running a DBT job with target.name dev and command dbt run-operation stage_external_sources (redacted to remove any real project references)

2023-05-03 06:59:18.249084 (MainThread): 06:59:18  Running with dbt=1.4.6
2023-05-03 06:59:18.250069 (MainThread): 06:59:18  running dbt with arguments {'write_json': True, 'use_colors': True, 'printer_width': 80, 'version_check': True, 'partial_parse': True, 'static_parser': True, 'profiles_dir': '/tmp/jobs/146510925/.dbt', 'send_anonymous_usage_stats': True, 'quiet': False, 'no_print': False, 'cache_selected_only': False, 'profile': 'user', 'target': 'dev', 'macro': 'stage_external_sources', 'args': '{}', 'which': 'run-operation', 'rpc_method': 'run-operation', 'indirect_selection': 'eager'}
2023-05-03 06:59:18.250390 (MainThread): 06:59:18  Tracking: tracking
2023-05-03 06:59:18.250935 (MainThread): 06:59:18  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae15bac0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae1611c0>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae1610a0>]}
2023-05-03 06:59:18.280578 (MainThread): 06:59:18  checksum: aba61efde97930ba02b80b791b54ff02a8bbe2330ef80fbc7edfe5ea7aa041b2, vars: {}, profile: user, target: dev, version: 1.4.6
2023-05-03 06:59:18.282337 (MainThread): 06:59:18  Unable to do partial parsing because saved manifest not found. Starting full parse.
2023-05-03 06:59:18.282687 (MainThread): 06:59:18  Sending event: {'category': 'dbt', 'action': 'partial_parser', 'label': '3bc21310-2c67-44e4-802f-6c75a94b4f1d', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae197370>]}
2023-05-03 06:59:19.642954 (MainThread): 06:59:19  1699: static parser successfully parsed staging/stg_customers.sql
2023-05-03 06:59:20.774045 (MainThread): 06:59:20  Sending event: {'category': 'dbt', 'action': 'load_project', 'label': '3bc21310-2c67-44e4-802f-6c75a94b4f1d', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdade23a90>]}
2023-05-03 06:59:20.793298 (MainThread): 06:59:20  Sending event: {'category': 'dbt', 'action': 'runnable_timing', 'label': '3bc21310-2c67-44e4-802f-6c75a94b4f1d', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdade5e0d0>]}
2023-05-03 06:59:20.794257 (MainThread): 06:59:20  Acquiring new bigquery connection 'macro_stage_external_sources'
2023-05-03 06:59:20.812945 (MainThread): 06:59:20  1 of 5 START external source dev_staging.customers
2023-05-03 06:59:20.819318 (MainThread): 06:59:20  On "macro_stage_external_sources": cache miss for schema "dbt-project-dev.dev_staging", this is inefficient
2023-05-03 06:59:20.819708 (MainThread): 06:59:20  Opening a new connection, currently in state init
2023-05-03 06:59:21.068162 (MainThread): 06:59:21  1 of 5 (1) create schema if not exists dev_staging
2023-05-03 06:59:21.076403 (MainThread): 06:59:21  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "user", "target_name": "dev", "connection_name": "macro_stage_external_sources"} */

        create schema if not exists dev_staging

2023-05-03 06:59:23.726538 (MainThread): 06:59:23  BigQuery adapter: https://console.cloud.google.com/bigquery?project=*******&j=bq:**:********&page=queryresults
2023-05-03 06:59:23.727188 (MainThread): 06:59:23  1 of 5 (1) None (0 processed)
2023-05-03 06:59:23.727726 (MainThread): 06:59:23  1 of 5 (2) create or replace external table `dbt-project-dev`.`dev_staging`.`customers  
2023-05-03 06:59:23.728751 (MainThread): 06:59:23  On macro_stage_external_sources: /* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "user", "target_name": "dev", "connection_name": "macro_stage_external_sources"} */

    create or replace external table `dbt-project-dev`.`dev_staging`.`customers`(

                customer_id STRING,
                name STRING)

        options (
            uris = ['https://docs.google.com/spreadsheets/d/*********************'], format = 'GOOGLE_SHEETS', sheet_range = 'customer', skip_leading_rows = 1)

2023-05-03 06:59:24.017728 (MainThread): 06:59:24  BigQuery adapter: Unhandled error while running:
/* {"app": "dbt", "dbt_version": "1.4.6", "profile_name": "user", "target_name": "dev", "connection_name": "macro_stage_external_sources"} */

    create or replace external table `dbt-project-dev`.`dev_staging`.`customers`(

                customer_id STRING,
                name STRING)

        options (
            uris = ['https://docs.google.com/spreadsheets/d/*********************'], format = 'GOOGLE_SHEETS', sheet_range = 'customer', skip_leading_rows = 1)

2023-05-03 06:59:24.018634 (MainThread): 06:59:24  BigQuery adapter: 404 Not found: Dataset dbt-project-dev:dev_staging was not found in location EU

Location: EU
Job ID: ***************

2023-05-03 06:59:24.019046 (MainThread): 06:59:24  BigQuery adapter: Unhandled error while running:
macro stage_external_sources
2023-05-03 06:59:24.019318 (MainThread): 06:59:24  BigQuery adapter: Runtime Error
  404 Not found: Dataset dbt-project-dev:dev_staging was not found in location EU

  Location: EU
  Job ID: ***************

2023-05-03 06:59:24.020140 (MainThread): 06:59:24  Encountered an error while running operation: Runtime Error
  404 Not found: Dataset dbt-project-dev:dev_staging was not found in location EU

  Location: EU
  Job ID: ***************

2023-05-03 06:59:24.023904 (MainThread): 06:59:24  Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 222, in exception_handler
    yield
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 662, in _retry_and_handle
    return retry.retry_target(
  File "/usr/local/lib/python3.8/dist-packages/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 445, in fn
    return self._query_and_results(
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 647, in _query_and_results
    iterator = query_job.result(timeout=job_execution_timeout)
  File "/usr/local/lib/python3.8/dist-packages/google/cloud/bigquery/job/query.py", line 1520, in result
    do_get_result()
  File "/usr/local/lib/python3.8/dist-packages/google/api_core/retry.py", line 349, in retry_wrapped_func
    return retry_target(
  File "/usr/local/lib/python3.8/dist-packages/google/api_core/retry.py", line 191, in retry_target
    return target()
  File "/usr/local/lib/python3.8/dist-packages/google/cloud/bigquery/job/query.py", line 1510, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/usr/local/lib/python3.8/dist-packages/google/cloud/bigquery/job/base.py", line 911, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/google/api_core/future/polling.py", line 261, in result
    raise self._exception
google.api_core.exceptions.NotFound: 404 Not found: Dataset dbt-project-dev:dev_staging was not found in location EU

Location: EU
Job ID: ***************

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/dbt/task/run_operation.py", line 57, in run
    self._run_unsafe()
  File "/usr/local/lib/python3.8/dist-packages/dbt/task/run_operation.py", line 47, in _run_unsafe
    res = adapter.execute_macro(
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/base/impl.py", line 999, in execute_macro
    result = macro_function(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dbt/clients/jinja.py", line 329, in __call__
    return self.call_macro(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dbt/clients/jinja.py", line 256, in call_macro
    return macro(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 220, in macro
  File "/usr/local/lib/python3.8/dist-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dbt/clients/jinja.py", line 329, in __call__
    return self.call_macro(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dbt/clients/jinja.py", line 256, in call_macro
    return macro(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 763, in __call__
    return self._invoke(arguments, autoescape)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 777, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 52, in macro
  File "/usr/local/lib/python3.8/dist-packages/jinja2/sandbox.py", line 393, in call
    return __context.call(__obj, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/jinja2/runtime.py", line 298, in call
    return __obj(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/base/impl.py", line 270, in execute
    return self.connections.execute(sql=sql, auto_begin=auto_begin, fetch=fetch)
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 462, in execute
    query_job, iterator = self.raw_execute(sql, fetch=fetch)
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 453, in raw_execute
    query_job, iterator = self._retry_and_handle(msg=sql, conn=conn, fn=fn)
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 662, in _retry_and_handle
    return retry.retry_target(
  File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.8/dist-packages/dbt/adapters/bigquery/connections.py", line 256, in exception_handler
    raise DbtRuntimeError(exc_message)
dbt.exceptions.DbtRuntimeError: Runtime Error
  404 Not found: Dataset dbt-project-dev:dev_staging was not found in location EU

  Location: EU
  Job ID: ***************

2023-05-03 06:59:24.024303 (MainThread): 06:59:24  Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'end', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae1c3370>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae4e6b20>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7efdae1c8220>]}
2023-05-03 06:59:24.024673 (MainThread): 06:59:24  Flushing usage events
2023-05-03 06:59:24.047226 (MainThread): 06:59:24  Connection 'macro_stage_external_sources' was properly closed.

System information

The contents of your packages.yml file:

Which database are you using dbt with?

The output of dbt --version: In DBT Cloud we are using version 1.4 (latest) for all environments. (We can't run dbt --version in DBT Cloud).

In DBT Core where we develop most of our models the output is:

Core:
  - installed: 1.4.5
  - latest:    1.5.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - bigquery: 1.4.3 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: macOS 13.3.1

The output of python --version: Python 3.8.16 (remember the bug occurs in DBT Cloud, so this is from my local setup)

Additional context

This problem seems to be a DBT Cloud, single-project setup problem.

Thootje commented 1 year ago

The above issue arises probably because the create_external_schema macro that is being used for BigQuery projects only specifies the dataset name to be used, but doesn't explicitly mention in which GCP project the external table should be outputted to.

This is only an issue for DBT Cloud projects where only 1 project is used to output to several GCP projects. When running a DBT Job the create schema DDL will have to infer the GCP project from the DBT Cloud connection, which in our case is the dbt-project-prod GCP project. So even though we specified to which database our external table should be outputted to, the create_external_schema macro doesn't take that into account and ignore the database property. The solution might be something along the lines of changing the DDL statement from:

       create schema if not exists {{ source_node.schema }}

to:

       create schema if not exists `{{ target.database }}`.{{ source_node.schema }}
jeremyyeo commented 1 year ago

Hey @Thootje - thanks for opening this issue. You did correctly identify why this is happening (#195) and we're going to try and fix it via #196.

Thootje commented 1 year ago

@jeremyyeo thanks! I saw the PR after I posted my issue unfortunately, but great to see its being worked on!

jeremyyeo commented 1 year ago

Closing in favour of #195