dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
217 stars 153 forks source link

[ADAP-887] [Feature] Support directly specifying the GCP project for python models #925

Closed jeremyyeo closed 1 month ago

jeremyyeo commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Some customers have Dataproc clusters in GCP projects that are neither:

This means that if we have something like:

And we have a dbt-bigquery setup like:

# ~/.dbt/profiles.yml
default:
  target: bq
  outputs:
    bq:
      type: bigquery
      method: service-account
      project: project-analytics
      dataset: project-analytics
      execution_project: project-billing
      dataproc_region: us-central1
      dataproc_cluster_name: my-dataproc-cluster
      gcs_bucket: my-gcs-bucket
...

Then our Python model:

# models/foo.py
def model(dbt, session):
    dbt.config(submission_method="cluster")
    data = [{"id": 1}]
    return session.createDataFrame(data)

Will be submitted to the dataproc cluster my-dataproc-cluster in GCP project project-billing (which is not where the cluster actually is) due to:

https://github.com/dbt-labs/dbt-bigquery/blob/ffe2175440e263e71f122a0b707b51ba2dfeeb54/dbt/adapters/bigquery/python_submissions.py#L98-L104

Would be ideal to be able to change the GCP project via a config where the dataproc cluster actually is - perhaps:

# models/foo.py
def model(dbt, session):
    dbt.config(submission_method="cluster", dataproc_cluster_project="project-dataproc-clusters")
    data = [{"id": 1}]
    return session.createDataFrame(data)

Describe alternatives you've considered

Don't think there is any beyond changing execution_project from project-billing to project-dataproc-clusters which of course changes the billing for SQL models - which we do not want.

Who will this benefit?

Users who have dataproc clusters in projects different to the output project / execution project.

Are you interested in contributing this feature?

No response

Anything else?

No response

dbeatty10 commented 1 year ago

Thanks for explaining this so well @jeremyyeo 🏆

I'm going to mark this as help_wanted for an interested community member to pick up.

dbeatty10 commented 1 year ago

Acceptance criteria

With the following project files, the python model will run in the project-dataproc-clusters project (from dataproc_cluster_project).

# ~/.dbt/profiles.yml
default:
  target: bq
  outputs:
    bq:
      type: bigquery
      method: service-account
      dataset: project-analytics

      # Note three different project names here:
      project: project-analytics
      execution_project: project-billing
      dataproc_cluster_project: project-dataproc-clusters

      dataproc_region: us-central1
      dataproc_cluster_name: my-dataproc-cluster
      gcs_bucket: my-gcs-bucket
...
# models/foo.py
def model(dbt, session):
    dbt.config(submission_method="cluster")
    data = [{"id": 1}]
    return session.createDataFrame(data)

And if models/foo.py is modified like the following, then it will run in the project-dataproc-clusters-2 project:

# models/foo.py
def model(dbt, session):
    dbt.config(submission_method="cluster", dataproc_cluster_project="project-dataproc-clusters-2")
    data = [{"id": 1}]
    return session.createDataFrame(data)

If dataproc_cluster_project is not defined in profiles.yml or within the .py config, then it its value should default to project-billing (the execution_project).

github-actions[bot] commented 1 month ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 month ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.