dbt-labs / dbt-adapters

Apache License 2.0
28 stars 38 forks source link

[ADAP-383] [Feature] Support Dynamic Data Masking in CTAS Statements #85

Open jdoldis opened 1 year ago

jdoldis commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Currently you cannot specify column masking policies in Snowflake CTAS statements with dbt-snowflake. For example, CREATE TABLE <table_name>(<col_name> <col_type> WITH MASKING POLICY <policy_name>) AS SELECT <query>.

As a workaround masking policies can be applied to columns in a dbt post hook using an ALTER TABLE statement. The issue with doing this is that the CTAS and ALTER TABLE statements cannot be issued in the same transaction, as per the Snowflake documentation - "Each DDL statement executes as a separate transaction". As a result there is there is a small window of time between the CTAS and ALTER TABLE <table_name> MODIFY COLUMN <column_name> SET MASKING POLICY <policy_name> statements where the data is not masked, and if the ALTER TABLE statement fails it would remain that way.

Supporting masking policy specification in the CTAS statement would fix this. As per the Snowflake documentation "Executing a CREATE TABLE … AS SELECT (CTAS) statement applies any masking policies on columns included in the statement before the data is populated in the new table".

It also may not be difficult to support this given the recent work on model contracts which provides the CREATE TABLE <table_name>(<col_name> <col_type>) AS SELECT <query> syntax. All that would need to be added is the WITH MASKING POLICY <policy_name> part of the statement.

One way to provide the config would be something like:

columns:
      - name: id
        description: "Primary key for this model"
        data_type: integer
        masking_policy: <policy_name>
        contraints: <constraint_list>
        ...

The masking policy could then be applied in get_columns_spec_ddl.

Describe alternatives you've considered

Using an ALTER TABLE statement in a post hook to apply the masking policy. As described above due to Snowflake DDL statements always being executed in separate transactions this leaves the possibility of unmasked data.

Who will this benefit?

Anyone that wants to take advantage of dynamic data masking in Snowflake using dbt.

Are you interested in contributing this feature?

Yes

Anything else?

No response

jdoldis commented 1 year ago

This has also been brought up in the slack community a few times, adding links for reference:

Fleid commented 1 year ago

@jtcohen6 I'm sorry if we already discussed this... but is that something on your radar for contracts?

Fleid commented 1 year ago

Thanks @jdoldis for the write-up, you make a great case for it :)

jdoldis commented 1 year ago

Thanks @Fleid , I'm writing a custom materialization to support the syntax as described above. Let me know if you'd like those changes contributed here. Otherwise, I look forward to seeing it at some point down the road 🙂

ingolevin commented 1 year ago

I have the exact same requirement, but for row access policies ! I think this can be implemented in one go, as it share the same CTAS syntax

CTAS:

CREATE TABLE <table_name> ( <col1> <data_type> [ with ] masking policy <policy_name> [ , ... ] )
  ...
  [ WITH ] ROW ACCESS POLICY <policy_name> ON ( <col1> [ , ... ] )

Config with masking and row access policies could like this:

columns:
      - name: id
        description: "Primary key for this model"
        data_type: integer
        masking_policy: <policy_name>
        row_access_policy: <policy_name>
        contraints: <constraint_list>
        ...

Currently the only option is via post_hook ALTER table|view ADD ROW ACCESS POLICY ...

Since one has to check the information_schema first if that relation (view or table) already has said RAP applied, because otherwise the ALTER command fails, this whole process can take up to 10 seconds [Edit: that was mainly due to a cluster_by, but even in the best case there is a 1sec delay] in which the data in the relation is unprotected.

ingolevin commented 1 year ago

Thanks @Fleid , I'm writing a custom materialization to support the syntax as described above. Let me know if you'd like those changes contributed here. Otherwise, I look forward to seeing it at some point down the road slightly_smiling_face

@jdoldis Would you release your custom materialization as a package, until this gets implemented into dbt-snowflake? I'd be very interested as well!

jdoldis commented 1 year ago

Hey @ingolevin, the materialisation I wrote is essentially a copy paste of the standard v1.5 table materialisation. The difference is I have modified the table_columns_and_constraints macro to support masking policies. In this modified macro I build the ddl by looping through the model columns and outputting the name/datatype, and then adding WITH MASKING POLICY <policy_name> if a masking policy is defined for the column.

jdoldis commented 1 year ago

Since I raised this it seems the ddl logic has moved around a bit, the relevant code that could be modified in the Snowflake adapter to support masking policies would now be this function I think.

jdoldis commented 1 year ago

Regarding creating a package, it would be good to hear back from @Fleid first. Ideally we could implement here, but if that's not possible I would be open to it 🙂

jtcohen6 commented 1 year ago

I'm overdue responding here!

It's true that, starting in v1.5, for models with enforced contracts, dbt will be able to template out create table DDL that includes a full column spec.

That's the prerequisite to defining row-level access policies & column-level masking policies while the table is being created, rather than via an alter statement right after! (And avoids a thorny problem where, copy grants would lead to folks having access to a just-replaced table before masking policies have been applied.)

I hadn't had row-level & column-level access/masking policies in scope for model contracts — but it would be very cool to combine the two, to consider the access/masking policy part of the contract, and to consider the removal of one such policy to be a breaking change to the contract (catchable in CI).

For the moment, it would be possible to stand this up via some macro overrides. (Maybe these policies could even be a constraint of type: custom, i.e. "free text input"?) Before we support this as out-of-the-box functionality, I'd want us to first see if there's an opportunity to standardize over the way this works on different data platforms that support related functionality. If we can offer a standard syntax for this, that abstracts over the capabilities on multiple platforms without leaking, we should do it :)

jdoldis commented 1 year ago

Sounds great @jtcohen6 , let me know if I can help!

jtcohen6 commented 1 year ago

I'm following up here after a good chat with @graciegoheen @dbeatty10 @dataders, given the prompt to support similar functionality in dbt-redshift (https://github.com/dbt-labs/dbt-redshift/issues/589). There's now support for this on several of our most popular data platforms:

I think the good implementation of this functionality would look like:

  1. Teaching dbt about functions, as a new "materialization"* + a new node type. This would be inclusive of both UDFs (many previous discussions + older issues) and masking/access rules (a special type of function that returns a boolean).
  2. Allowing dbt to reference masking/access rules within model contract definitions

_*These map to DWH objects, so they are members of the DAG, and models / other functions could call (= depend on) them. I don't think these are models because they aren't select statements — they can be parametrized, they cannot be previewed to return a dataset (...unless it's a UDTF, which blurs the boundary)._

(1) is a bigger lift than (2), and it's not something we have the capacity to prioritize right now — but in the meantime, I've asked @dataders to do a bit more thinking about what a good UX might look like :)

b-per commented 9 months ago

Databricks also allows data masking in CTA as well!

[CREATE OR] REPLACE TABLE
...

column_properties
  { NOT NULL |
    GENERATED ALWAYS AS ( expr ) |
    GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( [ START WITH start ] [ INCREMENT BY step ] ) ] |
    DEFAULT default_expression |
    COMMENT column_comment |
    column_constraint |
    MASK clause } [ ... ]
dbeatty10 commented 9 months ago

Moving this feature request to the dbt-adapters repo for further refinement since the underlying functionality is supported on many cloud data warehouses now (Redshift, Snowflake (Enterprise only), BigQuery, Databricks, Azure, etc.)

github-actions[bot] commented 3 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 2 months ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

jdoldis commented 2 months ago

I still think this would be a great security improvement for DBT, would be good to keep the issue open.

b-per commented 2 months ago

Today with contracts and constraints, we actually support tag-based masking policies with the follwing syntax

models:
  - name: my_model
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: id
        data_type: int
        constraints:
          - type: custom
            expression: "tag (my_tag = 'my_value')"

This requires defining all the columns and their types though.

Would it work for your use case?

jdoldis commented 2 months ago

That's great news, hadn't realised that, thanks @b-per ! I don't currently have all column types documented, but I can move to that 👍

jellerbee-altr commented 2 months ago

@b-per This seems promising but is there any documentation on custom constraints? In particular docs on the expression "tag (my_tag = 'my_value')"

jellerbee-altr commented 2 months ago

Okay I figured it out. The expression is simply the SQL to represent the constraint on the column.

grantith commented 2 months ago

@b-per That's interesting, I didn't know that was possible! I'll have to give it a shot. If only I had a good way to add row access policies to relations in Snowflake, too.

@jdoldis You might find something like dbt-osmosis useful for quickly generating model docs.

b-per commented 2 months ago

The reason we need a contract with all columns is that it is the only way where we can send a create or replace <cols_info_with_masking> mytable as .... We can't do it with just a partial list of columns.

To generate a YAML with all the columns you could also use generate_model_yaml from dbt-codegen .

As there seems to be some interest from folks about the custom constraint "trick", I will reach out to the Docs team to see if we could add it to the docs.

b-per commented 2 months ago

Also, I have not tried it, but I think that you could add directly a masking policy instead of using tags

models:
  - name: my_model
    config:
      contract: {enforced: true}
      materialized: table
    columns:
      - name: id
        data_type: int
        constraints:
          - type: custom
            expression: "masking policy my_policy"