Closed jeremyyeo closed 1 month ago
Thanks for explaining this so well @jeremyyeo 🏆
I'm going to mark this as help_wanted
for an interested community member to pick up.
With the following project files, the python model will run in the project-dataproc-clusters
project (from dataproc_cluster_project
).
# ~/.dbt/profiles.yml
default:
target: bq
outputs:
bq:
type: bigquery
method: service-account
dataset: project-analytics
# Note three different project names here:
project: project-analytics
execution_project: project-billing
dataproc_cluster_project: project-dataproc-clusters
dataproc_region: us-central1
dataproc_cluster_name: my-dataproc-cluster
gcs_bucket: my-gcs-bucket
...
# models/foo.py
def model(dbt, session):
dbt.config(submission_method="cluster")
data = [{"id": 1}]
return session.createDataFrame(data)
And if models/foo.py
is modified like the following, then it will run in the project-dataproc-clusters-2
project:
# models/foo.py
def model(dbt, session):
dbt.config(submission_method="cluster", dataproc_cluster_project="project-dataproc-clusters-2")
data = [{"id": 1}]
return session.createDataFrame(data)
If dataproc_cluster_project
is not defined in profiles.yml
or within the .py
config, then it its value should default to project-billing
(the execution_project
).
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Is this your first time submitting a feature request?
Describe the feature
Some customers have Dataproc clusters in GCP projects that are neither:
project
/database
config).execution_project
config).This means that if we have something like:
project-analytics
.project-billing
.project-dataproc-clusters
.my-dataproc-cluster
is.And we have a dbt-bigquery setup like:
Then our Python model:
Will be submitted to the dataproc cluster
my-dataproc-cluster
in GCP projectproject-billing
(which is not where the cluster actually is) due to:https://github.com/dbt-labs/dbt-bigquery/blob/ffe2175440e263e71f122a0b707b51ba2dfeeb54/dbt/adapters/bigquery/python_submissions.py#L98-L104
Would be ideal to be able to change the GCP project via a config where the dataproc cluster actually is - perhaps:
Describe alternatives you've considered
Don't think there is any beyond changing
execution_project
fromproject-billing
toproject-dataproc-clusters
which of course changes the billing for SQL models - which we do not want.Who will this benefit?
Users who have dataproc clusters in projects different to the output project / execution project.
Are you interested in contributing this feature?
No response
Anything else?
No response