duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
50 stars 17 forks source link

Globbing on dfs (`abfss`) endpoint for Azure Data Lake Storage Gen2 does not work properly #80

Open keen85 opened 2 months ago

keen85 commented 2 months ago

I noticed some unexpected behavior when using globbing on an ADLSGen2 (Azure Data Lake Storage Gen2; Azure Storage Account). ADLSGen2 supports blob and dfs endpoint. According to documentation, globbing should be supported for both. However, I noticed some queries that work for blob but not for the dfs endpoint:

endpoint executable? statement
blob (az) OK SELECT * FROM glob('az://some_folder/DAY=2024-09-20/HOUR=9/foo')
blob (az) OK SELECT * FROM glob('az://some_folder/DAY=2024-09-20/HOUR=9/*.json')
blob (az) OK SELECT * FROM glob('az://some_folder/DAY=2024-09-20/HOUR=9/*')
blob (az) OK SELECT * FROM glob('az://some_folder/DAY=2024-09-20/HOUR=9/**')
blob (az) OK SELECT * FROM glob('az://some_folder/DAY=2024-09-20/**')
dfs (abfss) OK SELECT * FROM glob('abfss://some_folder/DAY=2024-09-20/HOUR=9/foo.json')
dfs (abfss) ERROR SELECT * FROM glob('abfss://some_folder/DAY=2024-09-20/HOUR=9/*.json')
dfs (abfss) ERROR SELECT * FROM glob('abfss://some_folder/DAY=2024-09-20/HOUR=9/*')
dfs (abfss) ERROR SELECT * FROM glob('abfss://some_folder/DAY=2024-09-20/HOUR=9/**')
dfs (abfss) OK SELECT * FROM glob('abfss://some_folder/DAY=2024-09-20/**')

ERROR: SQL Error: java.sql.SQLException: Invalid Error: 404 The specified path does not exist. The specified path does not exist.

olsgaard commented 4 weeks ago

I'm having similar issues:

After starting duckdb from the commandline and setting up secrets, I get the following errors when using abfss, but not when using az protocol:

D SELECT * FROM 'abfss://<container>/orders/**/*.parquet' limit 10;
Not implemented Error: abfss do not manage recursive lookup patterns, daily-vessel-consumption/**/*.parquet is therefor illegal, only pattern ending by ** are allowed.

But if I try that, I get:

SELECT * FROM 'abfss://<container>/orders/**' limit 10;
Catalog Error: Table with name abfss://<container>/orders/** does not exist!
Did you mean "pg_tablespace"?
LINE 1: SELECT * FROM 'abfss://<container>/orders/...

I should be running the latest release version and am running on ubuntu linux.

$ duckdb --version
v1.1.2 f680b7d08f
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04 LTS
Release:        20.04
Codename:       focal