dataform-co / dataform

Dataform is a framework for managing SQL based data operations in BigQuery
https://cloud.google.com/dataform/docs
Apache License 2.0
851 stars 163 forks source link

Dataform Authentication Scopes #1869

Open benjaminwestern opened 1 week ago

benjaminwestern commented 1 week ago

The Problem: Say I have the following SQLX file code

config {
    type: "table",
    schema: "intermediate",
    name: "scv",
    tags: ["scv"],
    description: "Single customer view",
}

SELECT
    ID,
    Name,
    Email
FROM
    ${ref("google_sheet_table")}

Running this query in Dataform will give me: bigquery error: Access Denied: BigQuery BigQuery: Permission denied while getting Drive credentials. at [3:1]

This is due to the Application Default Credentials authentication not containing the correct scopes to perform the Google Sheets operation, even if the Service Account running the request contains the correct level of access to the Sheet.

If I was to perform this operation with a service account that had the correct Domain Wide Delegation or API access in Google Workspaces with this scope: 'https://www.googleapis.com/auth/spreadsheets' provided in the authentication request, the query would run as intended.

In python I can create the correct credential object without using Service Account keys as follows:

import requests
from google.oauth2 import service_account
from google.auth import impersonated_credentials
from google.auth import default, iam
from google.auth.transport import requests

TOKEN_URI = 'https://accounts.google.com/o/oauth2/token'
SCOPES = ['https://www.googleapis.com/auth/drive','https://www.googleapis.com/auth/admin.directory.user.readonly','https://www.googleapis.com/auth/cloud-platform']

def delegated_credentials(credentials, subject, scopes):
    try:
        # If we are using service account credentials from json file
        # this will work
        updated_credentials = credentials.with_subject(subject).with_scopes(scopes)
    except AttributeError:
        # This exception is raised if we are using default credentials
        request = requests.Request()

        # Refresh the default credentials. This ensures that the information
        # about this account, notably the email, is populated.
        credentials.refresh(request)

        # Create an IAM signer using the default credentials.
        signer = iam.Signer(
            request,
            credentials,
            credentials.service_account_email
        )

        # Create OAuth 2.0 Service Account credentials using the IAM-based
        # signer and the bootstrap_credential's service account email.
        updated_credentials = service_account.Credentials(
            signer,
            credentials.service_account_email,
            TOKEN_URI,
            scopes=scopes,
            subject=subject
        )
    except Exception:
        raise

    return updated_credentials

def main():
    # Default user credentials OR default SA credentials requires Service Account Token Creator and Service Account User
    credentials, _ = default()
    # Requires sevice account token creator for applied as a permission on itself - so that it can sign its own request
    service_account_with_delegated_admin = '<SERVICE_ACCOUNT_EMAIL>'
    # User must have permissions to perform the required tasks - Not specifically required for this task if the SA has the Google Sheet shared to them
    user_to_impersonate_in_workspaces = '<WORKSPACE_EMAIL>'

    try:
        # Impersonates the service account using the default user/sa
        target_credentials = impersonated_credentials.Credentials(
            source_credentials=credentials,
            target_principal=service_account_with_delegated_admin,
            target_scopes=SCOPES
        )

        # Creates delegate authentication to the workspaces users, using the impersonated SA 
        workspaces_credentials = delegated_credentials(target_credentials, user_to_impersonate_in_workspaces, SCOPES) 

    except Exception as e:
        print('Failed', e)

if __name__ == "__main__":
    # User / Default > Impersonated SA > Delegated User in Workspaces
    main()

I am wondering if dynamic scopes will become something that the Dataform Core library will support as more customers and teams begin to lean on External Tables like Google Sheets in their workflows.

Currently the simplest solution is to create a Scheduled Query that WRITE TRUNCATES the content from the Sheet into a Standard BQ Table which is then utilised as a declare in Dataform for further operations, which leads to potential out of order processing due to Dataform not being able to control the table so more assertions and tests are required.