entechlog / dbt-snow-mask

This repository contains source code for dbt package dbt_snow_mask.
https://hub.getdbt.com/entechlog/dbt_snow_mask/latest/
GNU General Public License v3.0
60 stars 25 forks source link

Define masking policies only in one schema #5

Closed rumbin closed 2 years ago

rumbin commented 3 years ago

This is up for discussion...

In my understanding it is the cleanest approach to have each masking policy only defined once in a single, common, configurable schema and keep re-using it instead of defining it repetitively in each schema where it is used.

See also https://www.snowflake.com/blog/masking-semi-structured-data-with-snowflake/ Section 'Design your approach...'.

What war the considerations that led to the current design decision?

entechlog commented 3 years ago

Hello @rumbin, Interesting thoughts

rumbin commented 3 years ago

@entechlog I am not sure I do understand your thoughts about a central dbt-snow-mask.yml. So I simply elaborate a bit more on my thoughts and hope we'll converge into the same solution at some point ;-).

My understanding of a masking policy is something which is rather re-usable. Just like the general definition of a test method. E.g., not_nul or unique are defined centrally as a one macro, respectively, to be used in many models' columns. Or, just like a UDF or a UDTF, which are also defined once in a central schema, like, e.g., public or utils or so. Specifying which policy to apply to which column is currently done via the schema YAML in the meta section and I feel that this is the most natural approach. So this should be kept in my eyes.

I think that your and my points of view differ when it comes to considering the reusability of a masking policy. I understand that there may be use cases where a masking policy is pretty schema specific, since the masking conditions are tailored to some specific roles which might in turn be closely related to the schema's content. In such a case it might be useful to place the policy in the same schema like the model it applies to, however, the policy's creation macro will anyway need to be uniquely named and, thus, it would not harm to put the policies in a common, central schema, as well.

Having the masking policies in a central schema ensures we have no duplication and we can configure different access control for the policies, than for the models's schemas.

BTW: masking semi-structured data is on my bucket list as well. However, I am a bit concerned about performance implications of the solution I linked above. Wondering to what extent this solution will hamper the columnar access of single VARIANT paths.

entechlog commented 3 years ago

I am not sure I do understand your thoughts about a central dbt-snow-mask.yml. So I simply elaborate a bit more on my thoughts and hope we'll converge into the same solution at some point ;-).

My understanding of a masking policy is something which is rather re-usable. Just like the general definition of a test method. E.g., not_nul or unique are defined centrally as a one macro, respectively, to be used in many models' columns. Or, just like a UDF or a UDTF, which are also defined once in a central schema, like, e.g., public or utils or so. Specifying which policy to apply to which column is currently done via the schema YAML in the meta section and I feel that this is the most natural approach. So this should be kept in my eyes.

Thank you for the detailed insights on your thoughts, We are totally on same page here

entechlog commented 3 years ago

I think that your and my points of view differ when it comes to considering the reusability of a masking policy. I understand that there may be use cases where a masking policy is pretty schema specific, since the masking conditions are tailored to some specific roles which might in turn be closely related to the schema's content. In such a case it might be useful to place the policy in the same schema like the model it applies to, however, the policy's creation macro will anyway need to be uniquely named and, thus, it would not harm to put the policies in a common, central schema, as well.

Had the same thoughts when I originally started, to have a schema for compliance and to define the masking policies in a central schema and reuse them in the required target database/schema. Had to take the current approach for ease of development but this is an enhancement which can be worked on in future

entechlog commented 3 years ago

BTW: masking semi-structured data is on my bucket list as well. However, I am a bit concerned about performance implications of the solution I linked above. Wondering to what extent this solution will hamper the columnar access of single VARIANT paths.

That is my concern as well, lets keep this thread open to share thoughts on performance and design considerations for this feature

rumbin commented 3 years ago

Had the same thoughts when I originally started, to have a schema for compliance and to define the masking policies in a central schema and reuse them in the required target database/schema. Had to take the current approach for ease of development but this is an enhancement which can be worked on in future

Great. I think the location/name of this compliance schema should be configurable as variables in the dbt-project.yml. In a similar fashion as this is solved by the dbt-artifacts package. I'll be on holidays the next couple of weeks but looking forward to helping on this task.

rumbin commented 3 years ago

One more consideration that comes to my mind:

When applying the masking policy to a table column, the policy is currently referenced with its fully-qualified path, including the database name.

This will lead to unexpected behaviour when cloning the database. In the clone the masked columns will still point to the original location of the masking policy and not to the cloned one. To remedy this, the masking has to be applied to the column by pointing to a relative location, omitting the database name and just specifying schema and name of the policy.

Specifying the database name is only relevant if the policies are stored in a database that differs from the target.database. While there might be use cases for such an architecture, my guess is that the number of cases where the policies are stored in the same database will make the majority.

Conclusion:. In my eyes, we should not include the database name in the policy path when applying the policy, or even better, we make it configurable, if the the fully-qualified path should be used.

rumbin commented 3 years ago

Regarding the fully-qualified path of the masking policy (see my last comment), I just made the experience that this is really troublesome with cloned databases:

The cloned DB ist still referencing the original policy then. As a consequence, updating an existing policy in the original DB will fail as soon as we are trying to create or replace the policy. It cannot be replaced, since the cloned DB refers to it and the policy is not automatically being unapplied by dbt_snow_mask as it is not its job to care for other DBs than the target.database.

entechlog commented 2 years ago

Thank You @rumbin for your inputs on this. Sorry for the very late response, I might try to find sometime to get started on this one, before I start just wanted to check to see if you have started on any of the enhancements ?

rumbin commented 2 years ago

@entechlog, unfortunately, I haven't found any time to start working on this. If you need any input or review from me, I'll happily help!

entechlog commented 2 years ago

No issues. Sure, I will pass on for your review/testing once I have something on this one

entechlog commented 2 years ago

Hello @rumbin @robscriva ,

Just made changes to add this feature and created a release candidate https://github.com/entechlog/dbt-snow-mask/releases/tag/0.1.8rc1

Will you be having sometime to review and test the changes ? Added integration tests as well part of this change and they are passing.

Test result with config enabled

vars:
  use_common_masking_policy_db: "True"
  common_masking_policy_db: "DEMO_DB"
  common_masking_policy_schema: "GDPR"

image

Test result without config in project image

Data in snowflake image

nadesansiva commented 2 years ago

This feature has been implemented now. Closing the issue.