Open jdoldis opened 1 year ago
This has also been brought up in the slack community a few times, adding links for reference:
@jtcohen6 I'm sorry if we already discussed this... but is that something on your radar for contracts?
Thanks @jdoldis for the write-up, you make a great case for it :)
Thanks @Fleid , I'm writing a custom materialization to support the syntax as described above. Let me know if you'd like those changes contributed here. Otherwise, I look forward to seeing it at some point down the road 🙂
I have the exact same requirement, but for row access policies ! I think this can be implemented in one go, as it share the same CTAS syntax
CTAS:
CREATE TABLE <table_name> ( <col1> <data_type> [ with ] masking policy <policy_name> [ , ... ] )
...
[ WITH ] ROW ACCESS POLICY <policy_name> ON ( <col1> [ , ... ] )
Config with masking and row access policies could like this:
columns:
- name: id
description: "Primary key for this model"
data_type: integer
masking_policy: <policy_name>
row_access_policy: <policy_name>
contraints: <constraint_list>
...
Currently the only option is via post_hook ALTER table|view ADD ROW ACCESS POLICY ..
.
Since one has to check the information_schema first if that relation (view or table) already has said RAP applied, because otherwise the ALTER command fails, this whole process can take up to 10 seconds [Edit: that was mainly due to a cluster_by
, but even in the best case there is a 1sec delay] in which the data in the relation is unprotected.
Thanks @Fleid , I'm writing a custom materialization to support the syntax as described above. Let me know if you'd like those changes contributed here. Otherwise, I look forward to seeing it at some point down the road slightly_smiling_face
@jdoldis Would you release your custom materialization as a package, until this gets implemented into dbt-snowflake? I'd be very interested as well!
Hey @ingolevin, the materialisation I wrote is essentially a copy paste of the standard v1.5 table materialisation. The difference is I have modified the table_columns_and_constraints macro to support masking policies. In this modified macro I build the ddl by looping through the model columns and outputting the name/datatype, and then adding WITH MASKING POLICY <policy_name>
if a masking policy is defined for the column.
Since I raised this it seems the ddl logic has moved around a bit, the relevant code that could be modified in the Snowflake adapter to support masking policies would now be this function I think.
Regarding creating a package, it would be good to hear back from @Fleid first. Ideally we could implement here, but if that's not possible I would be open to it 🙂
I'm overdue responding here!
It's true that, starting in v1.5, for models with enforced contracts, dbt will be able to template out create table
DDL that includes a full column spec.
That's the prerequisite to defining row-level access policies & column-level masking policies while the table is being created, rather than via an alter
statement right after! (And avoids a thorny problem where, copy grants
would lead to folks having access to a just-replaced table before masking policies have been applied.)
I hadn't had row-level & column-level access/masking policies in scope for model contracts
— but it would be very cool to combine the two, to consider the access/masking policy part of the contract, and to consider the removal of one such policy to be a breaking change to the contract (catchable in CI).
For the moment, it would be possible to stand this up via some macro overrides. (Maybe these policies could even be a constraint of type: custom
, i.e. "free text input"?) Before we support this as out-of-the-box functionality, I'd want us to first see if there's an opportunity to standardize over the way this works on different data platforms that support related functionality. If we can offer a standard syntax for this, that abstracts over the capabilities on multiple platforms without leaking, we should do it :)
Sounds great @jtcohen6 , let me know if I can help!
I'm following up here after a good chat with @graciegoheen @dbeatty10 @dataders, given the prompt to support similar functionality in dbt-redshift
(https://github.com/dbt-labs/dbt-redshift/issues/589). There's now support for this on several of our most popular data platforms:
I think the good implementation of this functionality would look like:
functions
, as a new "materialization"* + a new node type. This would be inclusive of both UDFs (many previous discussions + older issues) and masking/access rules (a special type of function that returns a boolean)._*These map to DWH objects, so they are members of the DAG, and models / other functions could call (= depend on) them. I don't think these are models because they aren't select
statements — they can be parametrized, they cannot be previewed to return a dataset (...unless it's a UDTF, which blurs the boundary)._
(1) is a bigger lift than (2), and it's not something we have the capacity to prioritize right now — but in the meantime, I've asked @dataders to do a bit more thinking about what a good UX might look like :)
Databricks also allows data masking in CTA as well!
[CREATE OR] REPLACE TABLE
...
column_properties
{ NOT NULL |
GENERATED ALWAYS AS ( expr ) |
GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] |
DEFAULT default_expression |
COMMENT column_comment |
column_constraint |
MASK clause } [ ... ]
Moving this feature request to the dbt-adapters repo for further refinement since the underlying functionality is supported on many cloud data warehouses now (Redshift, Snowflake (Enterprise only), BigQuery, Databricks, Azure, etc.)
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
I still think this would be a great security improvement for DBT, would be good to keep the issue open.
Today with contracts and constraints, we actually support tag-based masking policies with the follwing syntax
models:
- name: my_model
config:
contract: {enforced: true}
materialized: table
columns:
- name: id
data_type: int
constraints:
- type: custom
expression: "tag (my_tag = 'my_value')"
This requires defining all the columns and their types though.
Would it work for your use case?
That's great news, hadn't realised that, thanks @b-per ! I don't currently have all column types documented, but I can move to that 👍
@b-per This seems promising but is there any documentation on custom constraints? In particular docs on the expression "tag (my_tag = 'my_value')"
Okay I figured it out. The expression is simply the SQL to represent the constraint on the column.
@b-per That's interesting, I didn't know that was possible! I'll have to give it a shot. If only I had a good way to add row access policies to relations in Snowflake, too.
@jdoldis You might find something like dbt-osmosis useful for quickly generating model docs.
The reason we need a contract with all columns is that it is the only way where we can send a create or replace <cols_info_with_masking> mytable as ...
. We can't do it with just a partial list of columns.
To generate a YAML with all the columns you could also use generate_model_yaml
from dbt-codegen .
As there seems to be some interest from folks about the custom constraint "trick", I will reach out to the Docs team to see if we could add it to the docs.
Also, I have not tried it, but I think that you could add directly a masking policy instead of using tags
models:
- name: my_model
config:
contract: {enforced: true}
materialized: table
columns:
- name: id
data_type: int
constraints:
- type: custom
expression: "masking policy my_policy"
Is this your first time submitting a feature request?
Describe the feature
Currently you cannot specify column masking policies in Snowflake
CTAS
statements with dbt-snowflake. For example,CREATE TABLE <table_name>(<col_name> <col_type> WITH MASKING POLICY <policy_name>) AS SELECT <query>
.As a workaround masking policies can be applied to columns in a dbt post hook using an
ALTER TABLE
statement. The issue with doing this is that theCTAS
andALTER TABLE
statements cannot be issued in the same transaction, as per the Snowflake documentation - "Each DDL statement executes as a separate transaction". As a result there is there is a small window of time between theCTAS
andALTER TABLE <table_name> MODIFY COLUMN <column_name> SET MASKING POLICY <policy_name>
statements where the data is not masked, and if theALTER TABLE
statement fails it would remain that way.Supporting masking policy specification in the CTAS statement would fix this. As per the Snowflake documentation "Executing a CREATE TABLE … AS SELECT (CTAS) statement applies any masking policies on columns included in the statement before the data is populated in the new table".
It also may not be difficult to support this given the recent work on model contracts which provides the
CREATE TABLE <table_name>(<col_name> <col_type>) AS SELECT <query>
syntax. All that would need to be added is theWITH MASKING POLICY <policy_name>
part of the statement.One way to provide the config would be something like:
The masking policy could then be applied in get_columns_spec_ddl.
Describe alternatives you've considered
Using an
ALTER TABLE
statement in a post hook to apply the masking policy. As described above due to Snowflake DDL statements always being executed in separate transactions this leaves the possibility of unmasked data.Who will this benefit?
Anyone that wants to take advantage of dynamic data masking in Snowflake using dbt.
Are you interested in contributing this feature?
Yes
Anything else?
No response