ChenyuLInx commented 6 months ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt-bigquery functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Big Query support running Spark as stored procedure. Link We can add a new submission method in dbt-bigquery to run python models leveraging it.

Basically dbt would still do jinja templating as current but when submitting the job, it will create/replace a stored procedure, and call it.

There are also some ways to get output from the stored procedure that we might be able to integrate with

Describe alternatives you've considered

No response

Who will this benefit?

No response

Are you interested in contributing this feature?

No response

Anything else?

No response

ChenyuLInx commented 6 months ago

I am trying to follow the tutorial here to create a python stored procedure, and found that I got stuck trying to to create a connection. On page https://cloud.google.com/bigquery/docs/working-with-connections#bq_1, I am not sure how to create the connection I need to use in stored prodedure.

UPDATE:

found https://cloud.google.com/bigquery/docs/connect-to-spark and created a connection, let's see if it works.
successfully created the stored procedure, running it, been 4 mins and still executing
job failed, realized connection do not have permission, found the service account associated with the connection, granted permission and rerunning
still trying to figure out the right permission to get it working
still
It do not seems like I can find the generated service account and give it access to things

Based on this experience, I feel like it is even more opaque than dataproc serverless, where it took less time to setup.

ChenyuLInx commented 4 months ago

Talked to BQ team yesterday(thanks) and got a walk-through of how to give access to the the connection account created by BQ. Will pick this up soon!

velascoluis commented 2 months ago

hi @ChenyuLInx draft some code here https://github.com/dbt-labs/dbt-bigquery/compare/main...velascoluis:dbt-bigquery:spark-stored-procedure Can you have a look perhaps?

End to end setup:

Spark stored procedures as a dbt python submission method example run

Setup

Execute the following from a Terminal, Cloud Shell for instance:

Setup environment variables, for example:

export GOOGLE_CLOUD_PROJECT=spark-dbt-test #Change with your own project
export REGION=us-central1 #Change with your own region

Create bucket to stage code and serve as a SPARK staging bucket:

gcloud storage buckets create gs://${GOOGLE_CLOUD_PROJECT}-bucket --location=${REGION}

Create a dataset on BigQuery:

bq mk --location=${REGION} `echo "${GOOGLE_CLOUD_PROJECT}" | tr '-' '_'`_dataset

Create a new service account, it will be used as the main dbt identity:

gcloud iam service-accounts create ${GOOGLE_CLOUD_PROJECT}-sa

Grant permissions on the service account:

#Execute BQ jobs
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/bigquery.user

#Use bigquery.connections.delegate
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/bigquery.connectionAdmin

#Acces staging bucket
gcloud storage buckets add-iam-policy-binding gs://${GOOGLE_CLOUD_PROJECT}-bucket --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin

--Create tables on the dataset
GRANT `roles/bigquery.dataEditor`
ON SCHEMA `${GOOGLE_CLOUD_PROJECT}`.`echo '${GOOGLE_CLOUD_PROJECT}' | tr '-' '_'`_dataset
TO 'serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com'

#Avoid Requesting user cannot act as service account or cross-project service account usage is disabled
gcloud projects add-iam-policy-binding ${GOOGLE_CLOUD_PROJECT} --member=serviceAccount:${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com --role=roles/iam.serviceAccountUser

Create a spark external connection:

bq mk --connection --location=${REGION} --project_id=${GOOGLE_CLOUD_PROJECT} --connection_type=SPARK ${GOOGLE_CLOUD_PROJECT}-conn

Get the service account and grant permissions:

CONNECTION_SA=`bq show --format=prettyjson --connection ${GOOGLE_CLOUD_PROJECT}.${REGION}.${GOOGLE_CLOUD_PROJECT}-conn | jq -r ."spark"."serviceAccountId"`

#Access pyspark code in bucket
gcloud storage buckets add-iam-policy-binding gs://${GOOGLE_CLOUD_PROJECT}-bucket --member=serviceAccount:${CONNECTION_SA} --role=roles/storage.admin

Create a JSON key for the service account:

gcloud iam service-accounts keys create ${HOME}/${GOOGLE_CLOUD_PROJECT}-sa.json --iam-account=${GOOGLE_CLOUD_PROJECT}-sa@${GOOGLE_CLOUD_PROJECT}.iam.gserviceaccount.com

dbt core installation

Install dbt from source

git clone https://github.com/velascoluis/dbt-bigquery
cd dbt-bigquery
git checkout spark-stored-procedure
pip3 install .

Init a new dbt project:

cd
dbt init

Enter values on the interactive shell:

Enter dbt project name. e.g. dbt_sp_velascoluis
Select bigquery as database [1]
Select service account as authentication method [2]
Enter path of the JSON keyfile
Enter GCP project name. e.g. dbt-sp-velascoluis
Enter BQ dataset e.g. dbt_sp_velascoluis_dataset
Leave threads to 1
Leave timeout to 300
Select location e.g. US

Edit dbt profile.yml under $HOME/.dbt , and add gcs_bucket and spark_external_connection_name, it should looks like this:

spark_dbt_test:
  outputs:
    dev:
      dataset: spark_dbt_test_dataset
      job_execution_timeout_seconds: 300
      job_retries: 1
      keyfile: /home/admin_/spark-dbt-test-sa.json
      location: us-central1
      method: service-account
      priority: interactive
      project: spark-dbt-test
      threads: 1
      type: bigquery
      gcs_bucket: spark-dbt-test-bucket
      spark_external_connection_name: spark-dbt-test-conn
  target: dev

NOTE: Additionally you can specify submission_method: stored_procedure, which is the default now

Run a dbt debug to make sure everything its working ok:

cd dbt_spark_stored_procedure
dbt debug

Run a sample pipeline with python code

Add a new python file under project_folder/models/example for example my_third_python_model.py:

import pandas as pd

def model(dbt, session):
    second_model = dbt.ref("my_second_dbt_model")
    df = second_model.toPandas()
    return df

Run the pipeline from the project root:

cd dbt_spark_stored_procedure
dbt run

ChenyuLInx commented 2 months ago

@velascoluis the change set looks great! I will test it locally today. Do you want to add some test to get a PR going or do you want us to take it over?

dbt-labs / dbt-bigquery

[Feature] Support Python Stored procedure as a way to submit python models #1086

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

Spark stored procedures as a dbt python submission method example run

Setup

dbt core installation

Run a sample pipeline with python code