Take a look at the companion blog for more info!
:warning: THIS PROJECT IS PROVIDED AS-IS WITHOUT ANY GUARANTEES: we make no claims as to the accuracy of the PII detection provided here, and if you decide to use it, it's YOUR RESPONSIBILITY to ensure that the example regexes and detection/redaction/tagging meets your internal, legal or regulatory requirements.
Delta Live Tables makes it easy to build and manage reliable data pipelines that deliver high-quality data on Delta Lake.
To get this pipeline running on your environment, please use the following steps:
GENERATE_CLEAN_DATA
: Whether to generate 4 records of artificially created "clean data" specifically designed not to get evaluated as PIIGENERATE_PII_DATA
: Whether to generate fake PII dataNUM_ROWS
: The number of rows of fake PII data to generateOUTPUT_DIR
: The path on cloud storage to write the generated data out toINPUT_PATH
: The path on cloud storage where the input data is locatedINPUT_FORMAT
: The format of the input data. One of "delta", "parquet", "json", "csv" is supportedTABLE_PATH
: The path to write out all of the tables created by the pipeline toSTORAGE_PATH
: A location on cloud storage where output data and metadata required for the pipeline execution are stored. This should match the Storage Location
entered below.EXPECTATIONS_PATH
: The path to the pii_firewall_rules.json config file uploaded to an DBFS accessible path (for example a mounted storage bucket). This is the main configuration file used to customise the behaviour of the detection/redaction/tagging of data. See Firewall Rules below for more detailsNUM_SAMPLE_ROWS
: In order to generate the SQL used to automatically redact the PII discovered, the pipeline will sample this many rows of data, evaluate it against your expectations and generate SQL code to leave it unchanged or redact it accordingly. The fewer rows sampled, the faster this initial stage of the pipeline will run, albeit the more likely that PII may make it through our firewallNESTED_DEPTH
: Because regexes aren't going to perform well on nested data, the pipeline will try and flatten any struct, map or array types in the data. It will iterate over the data to this depth to try and achieve thisTarget
: The name of the database for persisting all of the output dataStorage Location
: A location on cloud storage where output data and metadata required for the pipeline execution are stored. This should match the STORAGE_PATH
entered above.instance_profile_arn
to the aws_attributes objectspark_conf
objectgoogle_service_account
to the gcp_attributes
objectDATABASE_NAME
: The database to apply tagging (via properties) to. Should match the Target
entered above.TABLE_NAMES
: The tables within the databse to apply tagging (via properties) to and column level comments to. The DLT pipeline creates 3 main tables: clean
, redacted
and output
you can apply tagging to 1, 2 or all 3 of these.EXPECTATIONS_PATH
: The path to the dynamic_firewall_rules.json config file uploaded to an DBFS accessible path (for example a mounted storage bucket). This is the main configuration file used to customise the behaviour of the detection/redaction/tagging of data. See Firewall Rules below for more details.The following data tables and views are created by this pipeline:
NOTE: The pipeline will try and flatten the first level of any nested columns of your input data. This is for performance as well as accuracy reasons (you can't apply
REGEX
to astruct
,array
ormap
easily and so you could try to cast the entire column to a string but the performance of this will suck and it won't help you if you have lots of different nested PII). If you have more than one nested level you may want to consider what to do here - you could update the functionflatten_dataframe()
to recursively try to find nested columns, or you could update it to automatically drop any additional nested columns it finds after the first pass.
Name | Type | Description |
---|---|---|
staging | View | Initial view that data is loaded into. May contain PII and therefore declared as a view (so that PII is not persisted after the pipeline has been run. |
quarantine | View | View containing data that has failed expectations. May contain PII and therefore declared as a view (so that PII is not persisted after the pipeline has been run |
clean | Table | Table containing data that has passed expectations and therefore is not expected to contain PII |
redacted | Table | Table containing data that has failed expectations and therefore is expected to contain PII but in which that PII has been redacted based on the specified actions |
output | Table | A union of clean and redacted, creating a table that contains either data that has passed expectations and therefore is not expected to contain PII or data that is expected to contain PII but has been redacted based on the specified actions |
The following monitoring tables are created by this pipeline:
Name | Type | Description |
---|---|---|
event_logs | Table | Raw DLT event logs relating to the running and management of the pipeline |
audit_logs | Table | Logs capturing the management events relating to the pipeline |
data_quality_logs | Table | Logs capturing the DQ metrics relating to the pipeline |
flow_logs | Table | Logs capturing the runtime events relating to the pipeline |
When everything is set up correctly, run the MT Job and you should see something like this...
The pipeline following a successful run (10M rows of data):
The expectations evaluated against our sample data:
The dynamic_firewall_rules.json file is the main way that you can customise the behaviour of how the detection/redaction/tagging of data works. Within the file you'll notice a number of rules defined as follows:
"name": "",
"constraint": "",
"action": "",
"mode": "",
"tag":""
Every rule that you specify here will be applied against every column of your input data. If those columns contain structs of nested data, it will flatten the first layer of those so that the expectations can be applied uniformly across all of the fields contained within them. To add new rules, just add a new JSON object as follows:
Element | Is Mandatory | Can Contain | Description |
---|---|---|---|
name | Yes | any string | The name of the expectation. {} will be replaced by the column name. |
constraint | Yes | a valid SQL invariant | The expectation on which success or failure will determine whether the row contains PII or not. |
action | No | a valid SQL expression | The action that will be applied if any rows in the column fail when evaluated against their expectation AND the mode selected is REDACT or REDACT_AND_TAG |
mode | Yes | One of REDACT , TAG , or REDACT_AND_TAG | Whether to REDACT , just TAG or REDACT_AND_TAG all of the rows in any column if any rows in the column fail when evaluated against their expectation |
|
tag | No | any string | The comment to be added to any columns found to contain the. {} will be replaced by the column name. |
redact_threshold | Yes | A number between 0 and 100 | During the generation of redaction SQL, the pipeline will sample NUM_SAMPLE_ROWS to detect PII. If the % of failed expectations found in these sample rows is greater than the redact_threshold , that column will have the specified action applied to it. The threshold is useful for fine-tuning REGEXES which may generate a lot of false positives versus those you are confident will always find PII. |
tag_threshold | Yes | A number between 0 and 100 | During the PII tagging step, all failed expectations will be used to determine whether to tag the column as potentially containing PII. If the % of failed expectations for each expectation is greater than the tag_threshold , we will tag this column as potentially containing PII. This threshold is useful for fine-tuning REGEXES which may generate a lot of false positives versus those you are confident will always find PII. |
name
- short of having a dictionary of every person’s name on the planet - in every language and every alphabet - detecting a person’s name via regex is pretty much an impossible jobfreetext
- it's hard to iterate expectations and the associated regex sql across a column multiple times. Right now, only the first piece of PII detected in a column will be redacted, which may cause problems with columns that contain long strings of text