andyweaves / dlt-pii-firewall

Demo of using Delta Live Tables to identify potential PII data
Apache License 2.0
8 stars 4 forks source link

DLT PII Firewall

Take a look at the companion blog for more info!

:warning: THIS PROJECT IS PROVIDED AS-IS WITHOUT ANY GUARANTEES: we make no claims as to the accuracy of the PII detection provided here, and if you decide to use it, it's YOUR RESPONSIBILITY to ensure that the example regexes and detection/redaction/tagging meets your internal, legal or regulatory requirements.

Using Delta Live Tables to detect and redact PII data

image

Delta Live Tables makes it easy to build and manage reliable data pipelines that deliver high-quality data on Delta Lake.

Setup

To get this pipeline running on your environment, please use the following steps:

  1. Clone this Github Repo using Databricks Repos (see the docs for AWS, Azure, GCP)
  2. First you need some input data to run the pipeline on. If you don't have any data, or just want to try it out, you can use the 00_generate_data.py notebook to generate some, using the following options to customise that data generation:
    • GENERATE_CLEAN_DATA: Whether to generate 4 records of artificially created "clean data" specifically designed not to get evaluated as PII
    • GENERATE_PII_DATA: Whether to generate fake PII data
    • NUM_ROWS: The number of rows of fake PII data to generate
    • OUTPUT_DIR: The path on cloud storage to write the generated data out to
  3. Create a new DLT pipeline, selecting 01_observability.py and 02_detect_and_redact_pii.py as Notebook Libraries (see the docs for AWS, Azure, GCP). You’ll need add the following pipeline settings:
    • INPUT_PATH: The path on cloud storage where the input data is located
    • INPUT_FORMAT: The format of the input data. One of "delta", "parquet", "json", "csv" is supported
    • TABLE_PATH: The path to write out all of the tables created by the pipeline to
    • STORAGE_PATH: A location on cloud storage where output data and metadata required for the pipeline execution are stored. This should match the Storage Location entered below.
    • EXPECTATIONS_PATH: The path to the pii_firewall_rules.json config file uploaded to an DBFS accessible path (for example a mounted storage bucket). This is the main configuration file used to customise the behaviour of the detection/redaction/tagging of data. See Firewall Rules below for more details
    • NUM_SAMPLE_ROWS: In order to generate the SQL used to automatically redact the PII discovered, the pipeline will sample this many rows of data, evaluate it against your expectations and generate SQL code to leave it unchanged or redact it accordingly. The fewer rows sampled, the faster this initial stage of the pipeline will run, albeit the more likely that PII may make it through our firewall
    • NESTED_DEPTH: Because regexes aren't going to perform well on nested data, the pipeline will try and flatten any struct, map or array types in the data. It will iterate over the data to this depth to try and achieve this
    • Target: The name of the database for persisting all of the output data
    • Storage Location: A location on cloud storage where output data and metadata required for the pipeline execution are stored. This should match the STORAGE_PATH entered above.
  4. Note: once you’ve edited the settings that are configurable via the UI, you’ll need to edit the JSON so that you can add the configuration needed to authenticate with your chosen cloud storage:
    • For AWS add the instance_profile_arn to the aws_attributes object
    • For Azure add the Service Principal secrets to the spark_conf object
    • For GCP add the google_service_account to the gcp_attributes object
  5. As well as the DLT pipeline, the project contains the notebook 03_tag_pii.py. This is designed to run after the DLT pipeline has finished, and tag databases/tables/columns appropriate to confirm that:
    • They have been scanned for PII
    • That PII has either been found or not found
    • Where PII has been found, add a customisable comment to the column it was found in
  6. In order to get this to run straight after our DLT pipeline, we're going to create a multi-task job workflow (see the docs for AWS, Azure, GCP). You'll need to select the notebook 03_tag_pii.py and pass in the following Parameters:
    • DATABASE_NAME: The database to apply tagging (via properties) to. Should match the Target entered above.
    • TABLE_NAMES: The tables within the databse to apply tagging (via properties) to and column level comments to. The DLT pipeline creates 3 main tables: clean, redacted and output you can apply tagging to 1, 2 or all 3 of these.
    • EXPECTATIONS_PATH: The path to the dynamic_firewall_rules.json config file uploaded to an DBFS accessible path (for example a mounted storage bucket). This is the main configuration file used to customise the behaviour of the detection/redaction/tagging of data. See Firewall Rules below for more details.

Output Tables

The following data tables and views are created by this pipeline:

NOTE: The pipeline will try and flatten the first level of any nested columns of your input data. This is for performance as well as accuracy reasons (you can't apply REGEX to a struct, array or map easily and so you could try to cast the entire column to a string but the performance of this will suck and it won't help you if you have lots of different nested PII). If you have more than one nested level you may want to consider what to do here - you could update the function flatten_dataframe() to recursively try to find nested columns, or you could update it to automatically drop any additional nested columns it finds after the first pass.

image

Name Type Description
staging View Initial view that data is loaded into. May contain PII and therefore declared as a view (so that PII is not persisted after the pipeline has been run.
quarantine View View containing data that has failed expectations. May contain PII and therefore declared as a view (so that PII is not persisted after the pipeline has been run
clean Table Table containing data that has passed expectations and therefore is not expected to contain PII
redacted Table Table containing data that has failed expectations and therefore is expected to contain PII but in which that PII has been redacted based on the specified actions
output Table A union of clean and redacted, creating a table that contains either data that has passed expectations and therefore is not expected to contain PII or data that is expected to contain PII but has been redacted based on the specified actions

The following monitoring tables are created by this pipeline:

Name Type Description
event_logs Table Raw DLT event logs relating to the running and management of the pipeline
audit_logs Table Logs capturing the management events relating to the pipeline
data_quality_logs Table Logs capturing the DQ metrics relating to the pipeline
flow_logs Table Logs capturing the runtime events relating to the pipeline

Run the Job

When everything is set up correctly, run the MT Job and you should see something like this...

1. Multi-task Job Workflow to automatically detect, redact and tag PII:

image

2. DLT pipeline to automatically detect and redact PII:

The pipeline following a successful run (10M rows of data):

image

The expectations evaluated against our sample data:

image

3. Example of the column level PII tagging applied:

image

4. Example of the redacted output table:

image

Firewall Rules

The dynamic_firewall_rules.json file is the main way that you can customise the behaviour of how the detection/redaction/tagging of data works. Within the file you'll notice a number of rules defined as follows:

"name": "", 
"constraint": "",
"action": "",
"mode": "",
"tag":""

Every rule that you specify here will be applied against every column of your input data. If those columns contain structs of nested data, it will flatten the first layer of those so that the expectations can be applied uniformly across all of the fields contained within them. To add new rules, just add a new JSON object as follows:

Element Is Mandatory Can Contain Description
name Yes any string The name of the expectation. {} will be replaced by the column name.
constraint Yes a valid SQL invariant The expectation on which success or failure will determine whether the row contains PII or not.
action No a valid SQL expression The action that will be applied if any rows in the column fail when evaluated against their expectation AND the mode selected is REDACT or REDACT_AND_TAG
mode Yes One of REDACT, TAG, or REDACT_AND_TAG | Whether to REDACT, just TAG or REDACT_AND_TAG all of the rows in any column if any rows in the column fail when evaluated against their expectation
tag No any string The comment to be added to any columns found to contain the. {} will be replaced by the column name.
redact_threshold Yes A number between 0 and 100 During the generation of redaction SQL, the pipeline will sample NUM_SAMPLE_ROWS to detect PII. If the % of failed expectations found in these sample rows is greater than the redact_threshold, that column will have the specified action applied to it. The threshold is useful for fine-tuning REGEXES which may generate a lot of false positives versus those you are confident will always find PII.
tag_threshold Yes A number between 0 and 100 During the PII tagging step, all failed expectations will be used to determine whether to tag the column as potentially containing PII. If the % of failed expectations for each expectation is greater than the tag_threshold, we will tag this column as potentially containing PII. This threshold is useful for fine-tuning REGEXES which may generate a lot of false positives versus those you are confident will always find PII.

Next Steps