[CT-1391] [Feature] Add support for BigQuery emulator

bendiktv2 commented 1 year ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Add support for profile-configuration and connection to a custom BigQuery endpoint, to allow for emulating BigQuery.

https://docs.getdbt.com/reference/warehouse-setups/bigquery-setup

In addition to "project", "dataset" etc in the profiles.yml, one could specify a "endpoint"-config.

One could then run a BigQuery-emulator (such as https://github.com/goccy/bigquery-emulator), instead of having to run against the real BigQuery.

Note that this is already supported by the bq CLI tool.

Describe alternatives you've considered

No response

Who will this benefit?

This will benefit teams that want to run DBT unit-tests using https://github.com/mjirv/dbt-datamocktool in CI, to allow multiple branches to run at the same time, without conflicting.

Are you interested in contributing this feature?

No response

Anything else?

No response

jtcohen6 commented 1 year ago

@bendiktv2 Thanks for opening!

https://github.com/goccy/bigquery-emulator

This is cool!! I didn't know about GCP emulators at all. Have you been using this for some time? Any gotchas worth calling out?

Note that this is already supported by the bq CLI tool.

I tried looking quickly for docs on this; is it https://cloud.google.com/sdk/gcloud/reference/alpha/emulators?

I have no strong opposition to supporting this — it makes a ton of sense for unit testing, and it's possible because ZetaSQL being published and open source — I'd just want to see us doing some light diligence and internal education before documenting this as an officially supported plugin feature. I expect we'll be getting questions on it from lots of interested folks!

bendiktv2 commented 1 year ago

We have not been using it yet, but have been considering it for other purposes (like integration-testing a program that loads data into bigquery), and wanted to check out the feasibility of using it for DBT to reduce the time and complexity in our build-pipeline.

As for the bq-tool, you can see it in use in the goccy/bigquery-emulator Readme: https://github.com/goccy/bigquery-emulator#2-call-endpoint-from-bq-client:

bq --api http://0.0.0.0:9050 query --project_id=test "SELECT * FROM dataset1.table_a WHERE id = 1"

ustato commented 1 year ago

i’m working on it! https://github.com/ustato/dbt-bigquery

You can confirm that the command dbt run works well: https://github.com/ustato/dbt-bigquery-template

Don’t forget to run this command beforehand:

cd dbt-bigquery-template
poetry install

docker run --rm -p 9050:9050 -v "$(pwd)/emulator.yml:/data/emulator.yml" -it ghcr.io/goccy/bigquery-emulator:latest --project=local --data-from-yaml=/data/emulator.yml

You can this:

cd dbt
poetry run dbt --debug run --project-dir . --profiles-dir .

However, I am now wondering how to address this error... I think I can reproduce it by deleting all the parts about anonymous in emulator.yml .

404 GET http://0.0.0.0:9050/bigquery/v2/projects/local/datasets/anonymous/tables/anonymous?prettyPrint=false: dataset anonymous is not found

andrewegel commented 1 year ago

Hi there - We too are looking at leveraging dbt-bigquery with the bigquery-emulator - Merely applying the patch https://github.com/ustato/dbt-bigquery/commit/e9a65da67284e39de00e8b3188632427c7a612b6 on top of 1.4.1 and updating our dbt profiles made this work essentially.

The only issue we ran into was missing projects, datasets, and not having sufficient seed data in the emulator. @ustato My guess is you haven't told the bq-emulatror about the anonymous dataset in emulator.yml that your dbt model is referencing - If you share your dbt model code & emulator.yml I may be able to tell.

Dileep17 commented 1 year ago

Hi, Could you please let me know if big-query emulator is supported already? As @jtcohen6 mentioned this would enable unit testing without connecting to big-query.

Dileep17 commented 1 year ago

@andrewegel Glad to hear you got dbt-bigquery working with bigquery-emulator. Im trying to get the integration working. I applied patch https://github.com/ustato/dbt-bigquery/commit/e9a65da67284e39de00e8b3188632427c7a612b6 on dbt-bigquery 1.5.0b3. I'm facing below issues.

When materialization is view, create or replace view .... query is sent from dbt-bigquery to bigquery-emulator and process hangs. bigquery-emulator becomes unresponsive!
When materialization is table, dbt run fails with 'NoneType' object has no attribute 'path' error. Though the expected table is created with correct data, dbt run is failing !

could you please let me know if you faced above and how they are resolved? If you have public implementation on this integration, please share the link. Thank you in advance.

mgguo commented 1 year ago

@Dileep17 I ran into the same error as you mentioned - 'NoneType' object has no attribute 'path', would you mind sharing how to solve it if you've figured out how to solve it. Thanks in advance!

Dileep17 commented 1 year ago

@mgguo couldn't get that working!

mgguo commented 1 year ago

@Dileep17 Yeah, I think it's because the emulator API didn't return the expected response as the regular gcp bigquery API. i.e. the destination property here is not set => https://github.com/ustato/dbt-bigquery/blob/9dca9ebcd99f199b685bb35bac7eb32ff8e6624b/dbt/adapters/bigquery/connections.py#L497 which seems having caused the NoneType error.

@ustato have you got any chance to look back at this patch? were you able to get it working? thanks!

andrewegel commented 1 year ago

I did a little sleuthing through code, and from what I can see is from here:

https://github.com/googleapis/python-bigquery/blob/main/google/cloud/bigquery/job/base.py#L693-L702

The (part of) response back from BQ-the-real-service shows:

{
        'configuration': {
                'jobType': 'QUERY',
                'query': {
                        'query': 'create or replace table local.foo.bar AS (...)'
                        'destinationTable': {
                                'projectId': 'local',
                                'datasetId': 'foo',
                                'tableId': 'bar'
                        },
                        'priority': 'INTERACTIVE',
                        'useLegacySql': False
                }
        },

Where-as with the emulator the entire configuration.query.destinationTable block is absent:

{
        'configuration': {
                'jobType': 'QUERY',
                'query': {
                        'query': 'create or replace table local.foo.bar AS (...)',
                        'priority': 'INTERACTIVE',
                        'useLegacySql': False
                },
                'labels': {
                        'dbt_invocation_id': 'aa33cc6a-b2ee-4f07-b0b5-0851a473e30f'
                },
        },

This is likely because BQ-the-real-service supports the create or replace table local.foo.bar AS (...) syntax and populates the destinationTable fields in it's response to the clients - DBT-bigquery uses the create or replace table local.foo.bar AS (...) syntax to create new models as opposed to sending the destinationTable (Eg the bq command: bq query --destination_table local.foo.bar 'select * from baz.biz.buzz') with a normal select.

This could either be a feature request in dbt-bigquery to use this pattern (Specifying destinationTable in the API call) over create or replace table local.foo.bar AS (...) syntax, or this could be a bigquery-emulator feature request to have the emulator populate destinationTable in the response when it detects a create or replace table local.foo.bar AS (...) syntax.

Short term work around could be to patch https://github.com/ustato/dbt-bigquery/blob/9dca9ebcd99f199b685bb35bac7eb32ff8e6624b/dbt/adapters/bigquery/connections.py#L497

            if query_job.destination:
                num_rows = client.get_table(query_job.destination).num_rows
            else: 
                num_rows = 0

As this purely seems to be just for reporting rows on the console, as the sample DBT project works with this "patch":

17:34:26.439698 [info ] [Thread-1  ]: 1 of 1 OK created sql table model test_interface.sample ........................ [SELECT (0.0 rows, None processed) in 0.17s]

I opened an issue in that project https://github.com/goccy/bigquery-emulator/issues/197 on the behavior difference.

OTooleMichael commented 11 months ago

I made a PR for this which is working fine, and is less of a change that PR above. https://github.com/dbt-labs/dbt-bigquery/pull/1017

Allowing the setting of api_endpoint is also useful for more than just the emulator.

dbt-labs / dbt-bigquery