dbt-labs / dbt-bigquery

dbt-bigquery contains all of the code required to make dbt operate on a BigQuery database.
https://github.com/dbt-labs/dbt-bigquery
Apache License 2.0
210 stars 141 forks source link

[CT-2158] [Feature] Support Workload Identity Federation for Headless Authentication into BigQuery #549

Open ernestoongaro opened 1 year ago

ernestoongaro commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Traditionally, applications running outside Google Cloud can use service account keys to access Google Cloud resources. However, service account keys are powerful credentials, and can present a security risk if they are not managed correctly.

With identity federation, you can use Identity and Access Management (IAM) to grant external identities IAM roles, including the ability to impersonate service accounts. This approach eliminates the maintenance and security burden associated with service account keys.

Describe alternatives you've considered

Oauth is fine for developer authentication, but not great for something that will be scheduling the runs (like dbt Cloud)

Who will this benefit?

Any security-conscious GCP users

Are you interested in contributing this feature?

No response

Anything else?

Specifically this request is for use with Azure AD (which is OIDC compliant) but there are other schemes supported:

ernestoongaro commented 1 year ago

As a very good workaround to this; it is recommended that you UPLOAD service account keys to GCP and dbt Cloud, and aggressively rotate them image https://youtu.be/PAOb2hl__08?t=703

dbeatty10 commented 1 year ago

Thanks for opening this and also supplying a "very good workaround" @ernestoongaro !

What would you imagine the profiles.yml entry to look like for this new connection method?

my-snowflake-db:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: azure-ad

      # Identity federation for Azure AD auth
      some_key: [some_value]
      some_other_key: [some_other_value]
      ...
github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

ernestoongaro commented 10 months ago

This is being tracked internally at dbt Labs (this link is behind a login but can be used to reference in the future: https://dbt-labs.productboard.com/entity-detail/features/18692136)

dbeatty10 commented 10 months ago

We want to have a delightful experience around warehouse authentication that is friction-free, so re-opening this public-facing issue.

As @ernestoongaro mentioned, we have a separate internal ticket here that we've kept open as well since the most significant pieces of the implementation will take place within dbt Cloud.

b-per commented 10 months ago

Thanks for opening this and also supplying a "very good workaround" @ernestoongaro !

What would you imagine the profiles.yml entry to look like for this new connection method?

my-snowflake-db:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: azure-ad

      # Identity federation for Azure AD auth
      some_key: [some_value]
      some_other_key: [some_other_value]
      ...

@dbeatty10 , Workload Identity Federation is not for end user connection, but for headless/service access to resources, so Azure AD wouldn't fit in this use case.

Based on the comment you have above and some demonstration of how WIF works with GH Actions, the profile would most likely look more like:

my-snowflake-db:
  target: dev
  outputs:
    prod:
      type: bigquery
      method: workflow-identity-federation
      # Identity federation info
      workload_identity_provider: [some_value]
      service_account: [name_of_the_service_account]
      ...

On the GCP side, people would need to set a Workload Identity Pool and mention an OpenID Connect (OIDC) provider and its issuer URL. image

This means that we need to have a service in place that can communicate with GCP using the OIDC protocol. This would happen on the dbt Cloud side and would require some development there.

On the dbt-bigquery side, I will have a look at the code and identify what might need to be implemented.

b-per commented 10 months ago

The logic of of the GH action example from the video is available in the repo google-github-actions/auth in this page (in Typescript though).

We can see what API calls are made.

userbradley commented 7 months ago

I really don't want to do this, but +1.

We'd want to do something similar, so if it helps, register my interests to keep this issue open

mwstanleyft commented 2 months ago

It's not really clear to me what value this adds. Is this intended for dbt-core users or dbt Cloud?

If core, then you should be using workload identity via whatever worker is running dbt-core. We run dbt-core on kubernetes for example, and it uses workload identity and then has a federated identity it can use to authenticate to GCP as a GCP service account. dbt can run in the usual oauth mode once the worker has logged in, for example with gcloud auth login. There is a worked example of impersonating a GCP service account on an Azure worker here, for example. If, at the end of this guide, the worker executed a dbt command with oauth as the authentication method, that should work.

If Cloud then the ask seems more like, adding the ability to trust dbt Cloud as a workload identity provider, and I'm not sure that's the scope defined in the ticket?

ernestoongaro commented 2 months ago

Hi @mwstanleyft yes the intention is for getting it to work with dbt Cloud, there might be some changes required in Core. Thanks for the comment!

mwstanleyft commented 2 months ago

I see - then yeah, a lot of the discussion above doesn't make a ton of sense :D

For one thing, you wouldn't use profiles.yml if you're using dbt Cloud, would you?

I think the discussion above about Azure AD, on-prem AD FS, and Okta etc, are a red herring. I also think this feature has nothing particular to do with BigQuery or GCP, and the support required from dbt Labs is broadly the same for GCP/BigQuery as it would be for AWS IAM, Azure, or anyone else who supports federating login to a third party identity provider.

So for this request, you would want dbt Cloud itself to be trusted as an identity provider and provide an OIDC-compliant token issuer endpoint (similar to this one that GitHub provides) and its own identities. This would allow workloads in dbt Cloud to impersonate service accounts in your Cloud environment when they're working on dbt runs.

I'm not sure what this would require for the Cloud IDE to function correctly (because you would want production runs to use a separate service account to developers, who should probably be logged in via OAuth as normal), but simplistically the solution would probably be to have the connection manager in dbt Cloud support OIDC as an option - the user would need to set up their cloud environment properly and then provide the pool provider details and the service account name, and then dbt Cloud will be able to impersonate that service account with GCP on its own authority as the token issuer.

Seems like a substantial amount of effort on behalf of dbt Labs to deliver this!

mwstanleyft commented 2 months ago

And yeah I should reiterate that this isn't necessary at all for dbt-core since the end user is in control of the compute where dbt-core runs and can decide how they want to federate identities to that workload. It only matters for jobs running in dbt Cloud, where dbt Labs owns the worker running the dbt command.

seth-acuitymd commented 2 months ago

I think the issue in self hosting re; GCP Workload Identity, at least in what I saw, lies in specifically authentication to BigQuery? Based on the docs here - https://docs.dagster.io/integrations/bigquery/using-bigquery-with-dagster#prerequisites

It's been a while since I set up my Dagster instances in GKE but I believe the shortcoming to using Workload Identity when not running on Dagster Cloud was related to BQ (again, not sure if this is a Dagster or BQ limitation, I can provide some more info if it's helpful)

mwstanleyft commented 2 months ago

I have no shortcomings when using dbt-core in GKE and BQ and using Workload Identity. Works great. We also use Dagster but we self-host our agent in GKE along with all the workers for the jobs. You don't even need OIDC for that since it's GCP-native.