dbt-labs / dbt-external-tables

dbt macros to stage external sources
https://hub.getdbt.com/dbt-labs/dbt_external_tables/latest/
Apache License 2.0
298 stars 120 forks source link

Enable setting policy tags on BigQuery external tables #263

Closed thomas-vl closed 5 months ago

thomas-vl commented 6 months ago

Describe the feature

I would like to add policy tags to BigQuery external tables. https://cloud.google.com/bigquery/docs/column-level-security

Additional context

This is BigQuery specific

Who will this benefit?

Any organization that wishes to implement column level security in BigQuery on external tables.

dataders commented 6 months ago

OK, @thomas-vl thanks for making this issue! It's taken a few days for me to osmose the complete story here. Let me attempt to summarize the pickle in which we find ourselves.

  1. you can't persist_docs with sources, therefore you cannot add policytags
  2. This "makes sense", for sources, less so for external tables
  3. the future might look different, stay in touch with https://github.com/dbt-labs/dbt-adapters/discussions/92 for more context

Longer story

You can't touch this Sources

According to persist_docs docs page, persist_docs is not supported for sources. I can rationalize this, as traditionally, source tables are not created or modified by the analytics engineer via dbt.

For example, if the data engineering lands some clickstream data into a warehouse table for me to model for data analysts, I do not expect to have permission to modify the metadata of this table. The same is largely true in dbt Mesh: the upstream team controls the metadata and access permissions of the table on which the downstream table depends.

The real value of persist_docs imo, lies in embedding documentation where the data product end users are.

A valid workaround is to:

  1. create a wrapper view of a source table
  2. document it with column-level comments
  3. persist the docs to the view
  4. grant access to that view to intended audience, but not the original table

But wait, there's more!

Aren't we supposed to be talking about setting Column-level security via Policy Tags? Yes, which was very confusing to me until I found this important addition (thanks again @maria-samper for https://github.com/dbt-labs/docs.getdbt.com/pull/648!) on the BigQuery Configuration Page about Policy Tags

Please note that in order for policy tags to take effect, column-level persist_docs must be enabled for the model, seed, or snapshot.

Ah ok, so not only can I not change column comments/descriptions of source tables, neither can I change RBAC/IAM access metadata of columns in a source table. Bummer!

Bringing it all home

Herein lies the rub: external tables are sometimes more like seeds than sources. See https://github.com/dbt-labs/dbt-adapters/discussions/92 for more context

I can understand a perceived inefficiency of not being able to package an external table within a larger data product shared to end users. This is possible today with Seeds! However, the wrapper view workaround solves the problem from an end user perspective.

thomas-vl commented 6 months ago

Hi @dataders , So the external tables package is something that is fundamentally different then sources. A source is a way to reference a table that already exists somewhere that has its own permissions and modeling etc.

An external table is a materialized object that also can act as a source. Your comment that you are not able to set IAM policies on sources is absolutely valid. If someone creates some data somewhere they are 100% entitled to set permissions on that data.

But on the external table as this is a data object that I manage as an analytics engineer I am also entitled to set new types of permissions irrelevant of the permissions that has been set on the underlying data.

Historically you could not do this, historically when you created an external table you had to adher to the policies of the underlying data but now BigLake tables enter the playing field, in BigLake tables you use a connection to access the underlying data instead of direct access. With this connection the data engineer can give the permissions required to access the underlying table and the BigLake table can have its own permissions like policy tags.

Also your possible workaround would not work because you cant set policy tags on views. So if you want to do what you say you would require a full materialisation.

Please also note that with not having this feature we can't protect data in the chain, if we cant set policy tags on external tables we are not able to guarantee data masking while developing data products.

If I need to implement a flag somewhere that doing the API request is optional that would be fine for me but not abandon this completely.