danielbeach / lakescum

A Python package to help Databricks Unity Catalog users to read and query Delta Lake tables with Polars, DuckDb, or PyArrow.
Apache License 2.0
22 stars 1 forks source link

LakeScum

A Python pacakge to help Databricks Unity Catalog users to read and query Delta Lake tables with Polars, DuckDb, or PyArrow.

Unity Catalog does not place nice out-of-the-box with many of these tools using built in features like polars.read_delta() for example.

LakeScum takes that difficulty away.

Installation

LakeScum can be installed for Python with a simple pip command.

pip install lakescum

Usage

There are currently the methods to read and query a Unity Catalog Delta Lake with ...

Polars

You can query and return a Polars Dataframe from a Unity Catalog Delta Lake table with the following method. unity_catalog_delta_to_polars()

It takes 2 required parameters, and one optional.

spark: str - Spark Session 
table_name: str - Unity Catalog table name
sql_filter - Optional SQL WHERE clause filter

Example ...

polars_df = unity_catalog_delta_to_polars(spark, 
    'production.default.fact_orders',
    sql_filter="year = 2024 and month = 3 and day = 10")

print(polars_df.head(10))

order_id | product_id | order_date | quantity
1 | 4567 | '2024-03-10' | 5
DuckDb

This method will register a Unity Catalog Delta Table as a DuckDB table so you can query it with DuckDB. unity_catalog_delta_register_to_duckdb()

It takes 3 required parameters, and one optional.

spark: str - Spark Session 
unity_table_name: str - Unity Catalog table name
duck_table_name: str - Desired DuckDB table name
sql_filter - Optional SQL WHERE clause filter

Example ...

unity_catalog_delta_register_to_duckdb(spark, 
    "production.default.fact_orders",
    "test", 
    sql_filter="year = 2024 and month =3 and day = 19")

results = duckdb.sql("SELECT * FROM test")

print(results)
order_id | product_id | order_date | quantity
1 | 4567 | '2024-03-10' | 5
PyArrow

This method will return a PyArrow Table from a Unity Catalog Delta Table. unity_catalog_delta_to_pyarrow()

It takes 2 required parameters, and one optional.

pa = unity_catalog_delta_to_pyarrow(spark,
 "production.default.fact_orders",
  sql_filter="year = 2024 and month =3 and day = 19")

print(pa)
order_id | product_id | order_date | quantity
1 | 4567 | '2024-03-10' | 5