dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
220 stars 155 forks source link

[Bug] BQ quota_project defaults to the account/user's environment default #1347

Closed jcarpenter12 closed 2 weeks ago

jcarpenter12 commented 2 months ago

Is this a new bug in dbt-bigquery?

Current Behavior

When a user or service account's default quota_project_id is set to a project other than the execution_project specified in the profiles.yml DBT will use this as the quota_project_id in the underlying request opposed to anything set in the profiles.yml. This is a side effect and will mean that a DBT will fail to connect to BigQuery if the user does not have BigQuery API enabled in the project their credentials default to. It will also mean that certain quota limits are being used up in the default project.

You can see the default quota project_id from the contents of your application_default_credentials file. It should look something like this. A similar structure applies to the contents of service account keys I believe


{
  "account": "",
  "client_id": "xyz.apps.googleusercontent.com",
  "client_secret": "xyz",
  "quota_project_id": $MY_QUOTA_PROJECT,
  "refresh_token": "xyz",
  "type": "authorized_user",
  "universe_domain": "googleapis.com"
}

This issue arose when running a CI/CD pipeline that is defined in a project separate to either the project or execution_project and did not have the BQ API enabled, so it failed to connect.

Expected Behavior

What I think should be happening according to the docs

As this is a side effect the expected behaviour is a bit difficult to define but DBT should either set execution_project as the quota_project_id in the BQ client or allow the user to override it themselves in the profiles.yml.

In my opinion the user should be able to specify a separate optional quota_project to use separate from project and execution project. By not setting it, it will default to what it does now and not change the behaviour. Feature Request here #1343

Steps To Reproduce

  1. Create a GCP quota Project and do not enable the BQ API in it. This is to ensure there is no quota to use for it
  2. Create a GCP data project, enable BQ API and create a dataset called foo
  3. Create a GCP execution project, enable BQ API
  4. Make sure you user has access to all three of the above with the appropriate BQ permissions
  5. Running your local gcloud run the following gcloud config set project $MY_QUOTA_PROJECT (replace with the name of step 1)
  6. Generate your local oauth credentials gcloud auth application-default login
  7. You should then be able to cat the contents of that file using cat ~/.config/gcloud/application_default_credentials. It should look something like the following
    {
    "account": "",
    "client_id": "xyz.apps.googleusercontent.com",
    "client_secret": "xyz",
    "quota_project_id": $MY_QUOTA_PROJECT,
    "refresh_token": "xyz",
    "type": "authorized_user",
    "universe_domain": "googleapis.com"
    }
  8. Setup a python virtualenv and pip install dbt-core==1.8.6 dbt-bigquery==1.8.2
  9. Create a profiles.yml configuration that contains the following and substitute the names of your GCP projects with those you have created
    default:
    target: dev
    outputs:
    dev:
      type: bigquery
      method: oauth
      schema: foo
      location: europe-west2
      threads: 10
      timeout_seconds: 5000
      priority: interactive
      retries: 1
      project: $MY_DATA_PROJECT
      execution_project: $MY_EXECUTION_PROJECT
  10. Run the following dbt command to run the debug of the connection (update profiles-dir to wherever your profile is located dbt debug --connection --profiles-dir .

You should then see a log output like the below. Here you can see the quota project is causing this connection to fail even though it is not specified anywhere in the dbt profile itself and is instead being picked up from the environment.

Relevant log output

13:24:42  dbt was unable to connect to the specified database.
The database returned the following error:

  >Database Error
  BigQuery API has not been used in project $MY_QUOTA_PROJECT before or it is disabled. Enable it by visiting https://console.developers.google.com/apis/api/bigquery.googleapis.com/overview?project=$MY_QUOTA_PROJECT then retry. If you enabled this API recently, wait a few minutes for the action to propagate to our systems and retry.

Environment

- OS: macOS Ventura 13.2.1
- Python: 3.12.1
- dbt-core: 1.8.6
- dbt-bigquery: 1.8.2

Additional Context

This relates to #1343 a feature request to add quota_project and a potential PR for it here #1345

Docs PR : dbt-labs/docs.getdbt.com#6054

As noted on the #1343 I have not raised a change to a dbt adapter before so please do let me know if I am misunderstanding anything

dataders commented 2 months ago

@jcarpenter12 amazing and thorough write up!

What I still need help on is:

  1. what the expected behavior of your reproducible example should be. Like what should be happening?

    Here you can see the quota project is causing this connection to fail even though it is not specified anywhere in the dbt profile itself and is instead being picked up from the environment

  2. disambiguating:
    1. your reproducible example that highlights that something funny is happening when multiple projects are set up
    2. the behavior discovered that workloads aren't properly attributed to quota projects
jcarpenter12 commented 2 months ago

@jcarpenter12 amazing and thorough write up!

What I still need help on is:

  1. what the expected behavior of your reproducible example should be. Like what should be happening?

    Here you can see the quota project is causing this connection to fail even though it is not specified anywhere in the dbt profile itself and is instead being picked up from the environment

  2. disambiguating:

    1. your reproducible example that highlights that something funny is happening when multiple projects are set up
    2. the behavior discovered that workloads aren't properly attributed to quota projects

Thanks for getting back to me so quickly @dataders.

As for 1. I have updated the expected behaviour section with what I believe should be happening according to the docs.

As for 2. you're right it's not very clear. In my example to prove that it is picking up the quota project from the environment I created a new project that does not have the BQ api enabled in it. This way dbt debug would fail the connection because there is no quota it can access but prove that it is incorrectly trying to use the quota project from the environment rather than from the profiles.yml.

If i had set a project in my gcloud using gcloud config set project $MY_QUOTA_PROJECT that did have the BQ api enabled in it the debug command would pass as it would be able to access the quota project. However, this would only disguise the issue as it would appear to be running okay but actually it's attributing the quota usage to the one set in the user's environment.

Please let me know if these aren't suitable and I'll come up with another way to demonstrate the issue as I see it.

jcarpenter12 commented 2 months ago

Updated the title and contents of the issue. I have been speaking to the maintainers on their slack channel about this and we have managed to narrow down the issue, and the original issue was somewhat misleading

colin-rogers-dbt commented 3 weeks ago

@jcarpenter12 can we close as duplicate of #1343 or will additional work be required for this?

jcarpenter12 commented 3 weeks ago

@jcarpenter12 can we close as duplicate of #1343 or will additional work be required for this?

Hi @colin-rogers-dbt no problem with any of that my side. Raised it originally for what I thought was the bug and then a feature for a potential fix. But largely the feature details it the same way anyway and the bug does just clutter things a bit