databrickslabs / discoverx

A Swiss-Army-knife for your Data Intelligence platform administration.
Other
104 stars 11 forks source link
data-retrieval multi-table-operations pii-detection scanning semantic-classification

DiscoverX

Your Swiss-Army-knife for Lakehouse administration.

DiscoverX automates administration tasks that require inspecting or applying operations to a large number of Lakehouse assets.

Multi-table operations with SQL templates

You can execute a SQL template against multiple tables with

Multi-table operations with SQL template

DisocoverX will concurrently execute the SQL template against all Delta tables matching the selection pattern and return a Spark DataFrame with the union of all results.

Some useful SQL templates are

The available variables to use in the SQL templates are

A more advanced SQL example

You can filter tables that only contain a specific column name, and them use the column name in the queries.

Multi-table operations with SQL template

Multi-table operations with python functions

DiscoverX can concurrently apply python funcitons to multiple assets

Multi-table operations with python functions

The properties available in table_info are

Example Notebooks

Getting started

Install DiscoverX, in Databricks notebook type

%pip install dbl-discoverx

Get started

from discoverx import DX
dx = DX(locale="US")

You can now run operations across multiple tables.

Available functionality

The available dx functions are

from_tables Actions

After a with_sql or unpivot_string_columns command, you can apply the following actions:

Requirements

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.