dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.61k stars 1.59k forks source link

Add support for BigQuery "Authorized Views" #1718

Closed heisencoder closed 4 years ago

heisencoder commented 5 years ago

Describe the feature

Add support for BigQuery "Authorized Views".

See https://cloud.google.com/bigquery/docs/share-access-views

Many BigQuery users use authorized views as a way to provide access to users for a subset of data in a database without having to grant access to the whole database.

For this to work, we'd need a way to describe these authorizations on a dataset (i.e., schema in non-BigQuery adapters). This could possibly be done via extensions to the yml file formats, either in the schema.yml or dbt_project.yml files. Since this is a dataset property (not a model property), it can't be described in the post_hooks section for the models.

See also #1714, which runs into a similar problem for documenting a dataset. That feature also needs a way to specify metadata on a dataset/schema.

drewbanin commented 5 years ago

Hey @heisencoder - so many of the feature requests you've suggested would be such no-brainers if BQ's operational SQL support was a little bit more complete :D

I'm really dismayed to see that this functionality requires API requests - that makes it so much harder to hook into dbt's sql compilation pipeline. I'm (temporarily) slapping a #wontfix on here, insofar as there's no way for us to implement this until dbt has some sort of first-class notion of datasets/schemas in a yaml file somewhere. I'm still not particularly excited about supporting that, but I am very open to discussing it further.

In the meantime, I think my recommendation would be to manage these dataset permissions manually outside the scope of dbt. Let me know what you think!

heisencoder commented 5 years ago

I'm a little sad that this can't be supported now, but can investigate ways to add non-dbt extensions to add authorized views and dataset/schema documentation. That said, I feel like adding a field to the schema.yml file for dataset/schema documentation would itself be a useful feature for the documentation server, and after this is in place, it would be straightforward to add a hook for specifying authorized views.

drewbanin commented 5 years ago

Yep! Totally agree.

I think for databases like BQ that are api-driven, we should figure out a good approach for exposing hooks into the api via the adapter object in the dbt context.

Whereas on Redshift, a hook like this might look like:

post-hook:
  - 'grant usage on {{ this.schema }} to bi_user;'

we should instead be able to do something like:

post-hook:
  - '{% do adapter.addAccessEntry(this.schema, 'READER', 'groupByEmail', 'data_analysts@example.com') %}'

or similar. We'll need to be really careful with an approach like this. When dbt evaluates this jinja code, it will invoke an api call. We want that to happen at run-time (after the model has been built), not at parse-time!

I don't want to pollute the adapter namespace with a bunch of one-off methods like this, but maybe we can start simple and see where this takes us.

heisencoder commented 5 years ago

I hadn't thought about the adapter.addAccessEntry route, but this makes sense as a possible approach. Maybe the implementation could be via a new bigquery macro that only runs the adapter hook if execute is true. Something like this:

{% macro addAccessEntry(/* authorized view parameters */) -%}
  {%- if execute: -%}
    {{ adapter.addAccessEntry(/*authorized view parameters*/) }}
  {%- endif -%}
{%- endmacro %}
d-swift commented 4 years ago

I'm interested in adding this functionality -- if you can nudge me in the right direction I'll be happy to put a pull request together

heisencoder commented 4 years ago

Hi d-swift,

Drew or Jake can correct me, but I think it would be roughly like this:

include/bigquery/macros: <insert macro like the one described above>

adapters/bigquery/impl.py:

class BigQueryAdapter:
  ...
  @available
  def addAccessEntry(self, ...):
        conn = self.connections.get_thread_connection()
        client = conn.handle
        # See https://cloud.google.com/bigquery/docs/share-access-views

There's also documentation updates and testing...

But, before you start, make sure Drew will accept the pull request.

drewbanin commented 4 years ago

Oof - I missed these updates! Thanks @heisencoder - this is super spot-on.

My thinking has changed somewhat since making this comment above:

I don't want to pollute the adapter namespace with a bunch of one-off methods like this, but maybe we can start simple and see where this takes us.

Things like this are pretty well suited for Python, and I don't want to shoehorn them into SQL/jinja if they don't actually belong there.

For my part, I wish BigQuery would release DDL to create authorized views in SQL.

It looks like the magic happens with:

# Authorize the view to access the source dataset
access_entries = source_dataset.access_entries
access_entries.append(
    bigquery.AccessEntry(None, 'view', view.reference.to_api_repr())
)
source_dataset.access_entries = access_entries
source_dataset = client.update_dataset(
    source_dataset, ['access_entries'])  # API request

So, we're going to need:

So, I'd add a method to the BigQuery impl.py file that looks just like what @heisencoder showed above:

class BigQueryAdapter:
  ...
  @available
  def makeSecureView(self, source_dataset, authorized_view):
        conn = self.connections.get_thread_connection()
        client = conn.handle

        # See https://cloud.google.com/bigquery/docs/share-access-views

This method would implement the snippet shown above to add an access entry to the source dataset which authorizes selection from the newly created view. I'd add this logic at the bottom of the view.sql materialization for BigQuery with:

{% if config.get('secure') is True %}
  {% do adapter.makeSecureView(sourceDatasets, this) %}
{% endif %}

The one big challenge here is that sourceDatasets variable. Should we make dbt derive that automatically from the refs and sources present in the model? Or should that be user-supplied?

If anyone knows a PM over at Google, kindly inform them of the existence of grant statements! That would really simplify things like this :)

azhard commented 4 years ago

Is adding support for authorized views on the roadmap? Need this feature for my company's dbt project so I'm happy to contribute on the solution if needed

drewbanin commented 4 years ago

hey @azhard - this feature is not currently on the roadmap, and we don't intend to pick it up ourselves any time soon! It is a good feature though, and we'd be super happy to work with you on a PR that adds this functionality to dbt. Let us know how we can help!

azhard commented 4 years ago

Awesome I'll give it a shot then! In terms of the question you asked about whether dbt should automatically derive the sourceDatasets from the refs and sources in the model have you given any more thought to that?

I have a few thoughts on the matter:

  1. Having to manually specify the source datasets is kind of a pain especially when considering that the datasets should are already encapsulated in the model definition.
  2. Counter to that though, automatically inferring the source datasets might be a little dangerous if people start building views that reference a number of sources. They might only need to authorize for 1 dataset but would end up adding unintended access entries for many.

Based on that my thinking is auto-infer but throw an error if more than one source is present in the model. This forces users to make their views more 'modular' but I'm wondering if that imposes too much on the user to follow that pattern.

azhard commented 4 years ago

Also separate question, I'm struggling to access the sources from view.sql. I see them as a list of lists in the model variable under model.sources but I'm not sure how to convert these into actual source objects so that I can extract the dataset

azhard commented 4 years ago

Hey @drewbanin any thoughts on the above?

drewbanin commented 4 years ago

hey @azhard - thanks for following up on this one!

I just re-read the BQ docs for authorized views. This is an unfortunately implemented feature IMO, and I'm struggling to think through how dbt could work well in this environment. I think we should break this feature down into two parts:

  1. Some helper code that assists users in granting reader permissions to a group on a dataset
  2. A dbt-level config that authorized a view to select from a permissioned dataset

My general thinking here is that step 1, granting reader permissions to a group on a dataset, is really only something that needs to happen once for a given group+dataset. It would be a little silly to add the same group as a reader on a dataset every single time dbt built a view.

Opposite to that, dbt must try to add an access entry to a source dataset every time it creates a view.

At a high-level, I can imagine this working with a run-operation + a model config. That might look like:

# Add a group as a reader on a dataset
$ dbt run-operation grant_read_on_dataset --args '{"group": "data_analysts@example.com", "dataset": "source_data"}'

If we wanted to sugar this up, we could add a new field to the spec for the sources: block in .yml files, eg:

sources:
  - name: my_source
    config:
      readers: [data_analysts@example.com]
    tables:
      - name: ...

The run-operation could then execute the relevant grants for each of the sources with configured readers. Or something like that, anyway.

For the model config, I'm picturing:

-- configure a view model to be authorized on a dataset

{{
  config(
    materialized='view',
    grant_access_to=['source_dataset_1', 'source_dataset_2']
  )
}}

select ...

I think that being explicit about the grant_access_to datasets is a good idea. If we wanted to, we could add a helper macro that returns a list of datasets referenced as sources in a model. This actually will not work today (for a not-good reason), but we can definitely support it in the future.

Referencing sources in a model's config

Sources are "captured" by dbt's jinja parser whenever a {{ source(...) }} function is called. This means that if you have some code like:

{% do log(model.sources, info=true) %}

select ...
from {{ source('....') }}

Then model.sources will not be populated when the log function is called. dbt has not yet parsed the source() function call, so dbt is unaware of the fact that the view model depends on any sources.


So, this is a pretty involved issue, and it's made a lot gnarlier by BigQuery's implementation. In an ideal world, you could run a simple grant statement as a post-hook which did all of this stuff behind the scenes. I just threw a lot of stuff at you -- let me know what you think!

azhard commented 4 years ago

@drewbanin super helpful info thanks! Just pushed a PR, I decided to keep the implementation generic as there's a few ways that access_entries can be used