duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
50 stars 17 forks source link

[Feature Request] List folders in ADLSGen2 #79

Open keen85 opened 2 months ago

keen85 commented 2 months ago

Hi, I found the following to ways to list files in Azure Data Lake Storage Gen2 (Azure Storage Account with hierarchical namespaces):

SELECT file FROM glob("abfss://<container>@<storage_Account>.dfs.core.windows.net/*/*/*/_delta_log/_last_checkpoint");

SELECT size, filename, last_modified FROM read_blob('abfss://<container>@<storage_Account>.dfs.core.windows.net/*/*/*/_delta_log/_last_checkpoint');

I was wondering if there is a way to list folders.

What I'm trying to achieve:

I'm trying to find all Delta Lake tables in my ADLSGen2. A Folder named _delta_log would be an indication that the parent folder represents a Delta Lake table:

tmp/my_delta_lake_table
├── _SUCCESS
├── _delta_log
│   ├── 00000000000000000000.checkpoint.parquet
│   ├── 00000000000000000000.json
│   └── _last_checkpoint
├── part-00000-1f1cc136-76ea-4185-84d6-54f7e758bfb7-c000.snappy.parquet
├── part-00003-1f1cc136-76ea-4185-84d6-54f7e758bfb7-c000.snappy.parquet
├── part-00006-1f1cc136-76ea-4185-84d6-54f7e758bfb7-c000.snappy.parquet
└── part-00009-1f1cc136-76ea-4185-84d6-54f7e758bfb7-c000.snappy.parquet

My workaround (see above) is to look for the _last_checkpoint files, But these are not always present. Looking for _delta_log/*.json does work reliably but it is very slow if there are many Delta Lake tables with many versions.