danielbeach / lakescum

A Python package to help Databricks Unity Catalog users to read and query Delta Lake tables with Polars, DuckDb, or PyArrow.
Apache License 2.0
21 stars 1 forks source link

to_pandas is not a scalable approach #1

Open Johannes-Vink opened 4 months ago

Johannes-Vink commented 4 months ago

Nice approach and I was happy to start using this, however using to_pandas means that the driver is going to run hot and out of memory with big data sets.

Afaik also no lazy execution, it will execute immediately and process all data from the delta table.

A better way would be to query for every Delta table the physical location and then use a read_delta from for example polars.

Not sure yet if that is going to work with access control on the Azure Storage Account.

Riik commented 3 weeks ago

I agree, to_pandas kind of defeats the purpose. I've found that this works for me with Databricks / Unity Catalog set up on AWS:

def get_table_path(table_name):
    return spark.sql(f"DESCRIBE DETAIL {table_name}").first().location

storage_options = {"region": "eu-central-1"}

df = pl.scan_delta(get_table_path("my_database.some_table"), storage_options=storage_options)

For me this worked out of the box, but depending on your set up you might need to add the credentials to the storage options as well

Edit: this only works if you have direct access to the S3 bucket that the Delta table resides in, which might not be the case with a standard Unity Catalog set up. This example worked for an externally managed table

Riik commented 3 weeks ago

Update, I found the proper way to handle this when it comes to Unity Catalog. I used DuckDB's implementation as a reference. The core of it comes down to this, where you use the Unity Catalog REST API to get temporary storage credentials for accessing the underlying data source:

headers={'Authorization': f'Bearer {api_token}'}
params = {
        "table_id": table_id, 
        "operation": "READ"
}
temp_credentials = requests.post(api_url + "/api/2.1/unity-catalog/temporary-table-credentials", headers=headers, params=params).json()['aws_temp_credentials']

pl.read_delta(get_table_path(full_table_name), storage_options=temp_credentials)

Here is the unity catalog API reference, where you can use the listTables endpoint to get the table id.

Final piece of the puzzle, in Databricks you can obtain API credentials like so:

notebook_context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().safeToJson())
api_token = notebook_context['attributes']['api_token']
api_url = notebook_context['attributes']['api_url']

Let me know if you're interested in having a small PR with this implementation