Summary: Learn how to use Remote Functions to de/re-identify data with DLP in BigQuery using SQL.
This document discusses how to detect and de-identify sensitive data like personally identifiable information (PII) in BigQuery tables with simple SQL based functions, using Cloud Data Loss Prevention (Cloud DLP). De-identification techniques like encryption lets you preserve the utility of your data for joining or analytics while reducing the risk of handling the data by de-identifying/obfuscating the raw sensitive identifiers.
Enterprises may have policy or regulatory requirements to only store de-identified data in their DataWarehouses, and a need to quickly re-identify the de-identify data for reports generation. To minimize the risk of handling large volumes of sensitive data, you can use an automated data transformation pipeline to create de-identified datasets that can be used for migrating from on-premises to cloud. You can use this project to either replace that pipeline with a SQL query for de-identification and quick re-identification or only for re-identification.
Cloud DLP can inspect the data for sensitive information when the dataset has not been characterized, by using more than 100 built-in classifiers. Using DLP to de-identify the data requires complex data pipelines/applications. This solution aims to help your data analysts/engineers/scientists to achieve the same outcome through Simple SQL functions using BigQuery Remote Functions which are a powerful way to extend the functionality of BigQuery.
This document demonstrates a reference implementation of de-identifying structured data in BigQuery using DLP and remote function (hosted on Cloud Run).
This document is intended for audience whose responsibilities include data security, data processing, or data analytics. This guide assumes that you're familiar with data processing and data privacy, without the need to be an expert and run simple shell and SQL scripts.
This tutorial uses billable components of Google Cloud, including the following:
Use the pricing calculator to generate a cost estimate based on your projected usage.
For this tutorial, you need a Google Cloud project. To make cleanup easiest at the end of the tutorial, we recommend that you create a new project for this tutorial.
Make sure that billing is enabled for your Google Cloud project.
At the bottom of the Cloud Console, a Cloud Shell session opens and displays a command-line prompt. Cloud Shell is a shell environment with the Cloud SDK already installed, including the gcloud command-line tool, and with values already set for your current project. It can take a few seconds for the session to initialize.
In Cloud Shell, clone the source repository and go to the directory for this tutorial:
git clone https://github.com/GoogleCloudPlatform/bigquery-dlp-remote-function.git
cd bigquery-dlp-remote-function/
Enable all the required Google Cloud APIs
gcloud services enable \
artifactregistry.googleapis.com \
bigquery.googleapis.com \
bigqueryconnection.googleapis.com \
cloudbuild.googleapis.com \
cloudkms.googleapis.com \
containerregistry.googleapis.com \
dlp.googleapis.com \
run.googleapis.com \
secretmanager.googleapis.com
Authenticate using User Application Default Credentials ("ADCs") as a primary authentication method.
gcloud auth application-default login
Initialize and run the Terraform script to create all resources:
terraform init && \
terraform apply
Once the script successfully completes resources creation, visit BigQuery Console to run the test SQL script
SELECT
pii_column,
fns.dlp_freetext_encrypt(pii_column) AS dlp_encrypted,
fns.dlp_freetext_decrypt(fns.dlp_freetext_encrypt(pii_column)) AS dlp_decrypted
FROM
UNNEST(
[
'My name is John Doe. My email is john@doe.com',
'Some non PII data',
'212-233-4532',
'some script with simple number 1234']) AS pii_column
The query will produce a result similar to following table:
Row | pii_column | dlp_encrypted | dlp_decrypted |
---|---|---|---|
1 | My name is John Doe. My email is john@doe.com | My name is John Doe. My email is BQ_TRF_EMAIL(40):AQy6lGvwKR+AiiRqJpEr+nBzZUzOcjXkXamUugU= | My name is John Doe. My email is john@doe.com |
2 | Some non PII data | Some non PII data | Some non PII data |
3 | 212-233-4532 | BQ_TRF_PH(40):AbovCRcD0RwhqTJh1SuQmEfXERQoAmOKxwC2jc8= | 212-233-4532 |
4 | some script with simple number 1234 | some script with simple number 1234 | some script with simple number 1234 |
In case you want to customize the deployment, please use following steps:
Enable APIs for Compute Engine, Cloud Storage, Dataproc, and Cloud SQL services:
gcloud services enable --project "${PROJECT_ID}" \
artifactregistry.googleapis.com \
bigquery.googleapis.com \
bigqueryconnection.googleapis.com \
cloudbuild.googleapis.com \
cloudkms.googleapis.com \
containerregistry.googleapis.com \
dlp.googleapis.com \
iam.googleapis.com \
run.googleapis.com \
secretmanager.googleapis.com
In Cloud Shell, set the Cloud Region that you want to create your BigQuery and Cloud Run resources in:
PROJECT_ID="<PROJECT_ID>"
REGION="<REGION_ID>"
ARTIFACT_REGISTRY_NAME="<ARTIFACT_DOCKER_REGISTRY_NAME>"
CLOUD_RUN_SERVICE_NAME="bq-transform-fns"
For fine-grained access control, using a specialized service account for individual services is recommended.
Create a service Account:
RUNNER_SA_NAME="${CLOUD_RUN_SERVICE_NAME}-runner"
RUNNER_SA_EMAIL="${RUNNER_SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com"
gcloud iam service-accounts create "${RUNNER_SA_NAME}" \
--project="${PROJECT_ID}" \
--description "Runner for BigQuery remote function execution" \
--display-name "${RUNNER_SA_NAME}"
Grant permissions to the service account to access DLP
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
--member="serviceAccount:${RUNNER_SA_EMAIL}" \
--role='roles/dlp.deidentifyTemplatesReader'
gcloud projects add-iam-policy-binding "${PROJECT_ID}" \
--member="serviceAccount:${RUNNER_SA_EMAIL}" \
--role='roles/dlp.user'
This is a containerized SpringBoot application. Create an Artifact Registry to store the application's container image
gcloud artifacts repositories create "${ARTIFACT_REGISTRY_NAME}" \
--repository-format=docker \
--location="${REGION}" \
--description="Docker repository for Bigquery Functions" \
--project="${PROJECT_ID}"
Build the application container image using Cloud Build:
gcloud builds submit . \
--project="${PROJECT_ID}" \
--substitutions=_CONTAINER_IMAGE_NAME="${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_REGISTRY_NAME}/${CLOUD_RUN_SERVICE_NAME}:latest" \
--machine-type=e2-highcpu-8
Deploy Cloud Run by compiling and deploying Container :
gcloud beta run deploy ${CLOUD_RUN_SERVICE_NAME} \
--image="${REGION}-docker.pkg.dev/${PROJECT_ID}/${ARTIFACT_REGISTRY_NAME}/${CLOUD_RUN_SERVICE_NAME}:latest" \
--execution-environment=gen2 \
--platform=managed \
--region="${REGION}" \
--service-account="${RUNNER_SA_EMAIL}" \
--update-env-vars=PROJECT_ID=${PROJECT_ID} \
--no-allow-unauthenticated \
--project ${PROJECT_ID}
Retrieve and save the Cloud Run URL:
RUN_URL="$(gcloud run services describe ${CLOUD_RUN_SERVICE_NAME} --region ${REGION} --project ${PROJECT_ID} --format="get(status.address.url)")"
DLP Deidentify templates makes managing deidentification and inspection easy to document and manage.
DEID_TEMPLATE=$(curl -X POST \
-H "Authorization: Bearer `gcloud auth print-access-token`" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-H "X-Goog-User-Project: ${PROJECT_ID}" \
--data-binary "@sample_dlp_deid_config.json" \
"https://dlp.googleapis.com/v2/projects/${PROJECT_ID}/locations/${REGION}/deidentifyTemplates")
DEID_TEMPLATE_NAME=$(echo ${DEID_TEMPLATE} | jq -r '.name')
Note: Recommended practice is to use KMS Wrapped Key for DLP de-identification.
This document uses unwrapped key for simplification of demo purpose.
Create BigQuery connection for accessing Cloud Run:
bq mk --connection \
--display_name='External transform function connection' \
--connection_type=CLOUD_RESOURCE \
--project_id="${PROJECT_ID}" \
--location="${REGION}" \
ext-${CLOUD_RUN_SERVICE_NAME}
Find the BigQuery Service Account used for the connection:
CONNECTION_SA="$(bq --project_id ${PROJECT_ID} --format json show --connection ${PROJECT_ID}.${REGION}.ext-${CLOUD_RUN_SERVICE_NAME} | jq '.cloudResource.serviceAccountId')"
Grant the BigQuery connection Service Account Cloud Run Invoker role for accessing the Cloud Run:
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:${CONNECTION_SA}" \
--role='roles/run.invoker'
Define the BigQuery dataset to create remote functions:
BQ_FUNCTION_DATASET="fns"
Create the dataset if it doesn't exist:
bq mk --dataset \
--project_id ${PROJECT_ID} \
--location ${REGION} \
${BQ_FUNCTION_DATASET}
Create DLP de-identification function:
DLP uses default inspection template, if you want to use your custom Inspection template,
please add following to the user_defined_context
: ('dlp-inspect-template', '<full inspection template name>')
bq query --project_id ${PROJECT_ID} \
--use_legacy_sql=false \
"CREATE OR REPLACE FUNCTION ${BQ_FUNCTION_DATASET}.dlp_freetext_encrypt(v STRING)
RETURNS STRING
REMOTE WITH CONNECTION \`${PROJECT_ID}.${REGION}.ext-${CLOUD_RUN_SERVICE_NAME}\`
OPTIONS (endpoint = '${RUN_URL}', user_defined_context = [('mode', 'deidentify'),('algo','dlp'),('dlp-deid-template','${DEID_TEMPLATE_NAME}')]);"
Create DLP re-identification function:
DLP uses default inspection template, if you want to use your custom Inspection template,
please add following to the user_defined_context
: ('dlp-inspect-template', '<full inspection template name>')
bq query --project_id ${PROJECT_ID} \
--use_legacy_sql=false \
"CREATE OR REPLACE FUNCTION ${BQ_FUNCTION_DATASET}.dlp_freetext_decrypt(v STRING)
RETURNS STRING
REMOTE WITH CONNECTION \`${PROJECT_ID}.${REGION}.ext-${CLOUD_RUN_SERVICE_NAME}\`
OPTIONS (endpoint = '${RUN_URL}', user_defined_context = [('mode', 'reidentify'),('algo','dlp'),('dlp-deid-template','${DEID_TEMPLATE_NAME}')]);"
Execute the following query to observe that the remote function is deidentifying and reidentifying the data using SQL:
Using BigQuery Workspace
SELECT
pii_column,
fns.dlp_freetext_encrypt(pii_column) AS dlp_encrypted,
fns.dlp_freetext_decrypt(fns.dlp_freetext_encrypt(pii_column)) AS dlp_decrypted
FROM
UNNEST(
[
'My name is John Doe. My email is john@doe.com',
'Some non PII data',
'212-233-4532',
'some script with simple number 1234']) AS pii_column
Using Cloud Shell run the following command to run the query on BigQuery:
bq query --project_id ${PROJECT_ID} \
--use_legacy_sql=false \
"
SELECT
pii_column,
${BQ_FUNCTION_DATASET}.dlp_freetext_encrypt(pii_column) AS dlp_encrypted,
${BQ_FUNCTION_DATASET}.dlp_freetext_decrypt(${BQ_FUNCTION_DATASET}.dlp_freetext_encrypt(pii_column)) AS dlp_decrypted
FROM
UNNEST(
[
'My name is John Doe. My email is john@doe.com',
'Some non PII data'
]) AS pii_column"
Expected Output:
pii_column | dlp_encrypted | dlp_decrypted | aes_encrypted | aes_decrypted |
---|---|---|---|---|
My name is John Doe. My email is john@doe.com | My name is John Doe. My email is BQ_TOK_FN(40):AQy6lGvwKR+AiiRqJpEr+nBzZUzOcjXkXamUugU= | My name is John Doe. My email is john@doe.com | qWnwDEY3bDTCV/azveHnxUm24z/DvUcVLZWHyN4Rg16ISvqswcuYXNXsOyI4/d8M | My name is John Doe. My email is john@doe.com |
Some non PII data | Some non PII data | Some non PII data | 37Agm90CqzGXwerJxai/Tf89ffRN9FpPkuuDW+rz7ns= | Some non PII data |
--cpu=4 --memory=8Gi
for Cloud Run instanceTo avoid incurring charges to your Google Cloud account for the resources used in this tutorial, you can delete the project:
License: Apache 2.0
This is not an official Google product.