Open jegranado opened 1 month ago
Did you install and load the duckdb Azure extension?
Hello,
Yes. I can read .csv files in Azure, but delta_scan
fails with the message above.
I can use delta_scan
with local Delta tables.
just had a look at the code - looks like support for s3://
is hardcoded, so I suspect support for abfss://
would have to follow a similar path.
looking forward for this to be available
if I got it correctly...
static ffi::EngineBuilder* CreateBuilder
have to be changed to add the abfs://
possiblity by adding that condition to the path parser.
Then, the builder
would have to set specific attributes if Azure
type secrets exists.
Looking at the duckdb_azure extension codebase, at least the connection_string
would have to be setup (depending on which type of Azure secret was found)
I'd love to help myself, but I fall short of C++ skills here
I'm having the same problem, please add the support for azure.
@jegranado I tried to hack it into, build and loaded, however it doesn't change anything, do you have an idea?
I'am having the same problem. It works well if I directly query a parquet file within delta table folder, however gives me the same error when I use delta_scan
Hi, I tried this with GCS path "gs://" ... And indeed it does not work, while read_parquet works with the same credentials.
However, the error is very strange.
duckdb.duckdb.IOException: IO Error: Hit DeltaKernel FFI error (from: While trying to read from delta table: 'gs://my-bucket/my-table'): Hit error: 8 (ObjectStoreError) with message (Error interacting with object store: Generic GCS error: Error performing token request: Error after 10 retries in 6.122011584s, max_retries:10, retry_timeout:180s, source:error sending request for url (http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?audience=https%3A%2F%2Fwww.googleapis.com%2Foauth2%2Fv4%2Ftoken): error trying to connect: tcp connect error: Host is down (os error 64))
if I got it correctly...
static ffi::EngineBuilder* CreateBuilder
have to be changed to add theabfs://
possiblity by adding that condition to the path parser.Then, the
builder
would have to set specific attributes ifAzure
type secrets exists. Looking at the duckdb_azure extension codebase, at least theconnection_string
would have to be setup (depending on which type of Azure secret was found)I'd love to help myself, but I fall short of C++ skills here
I implemented Azure functionality, however I'm not sure if that is the right way. It uses duckdb Azure plugin and delta rs blob authentication, its complicated as they are both requiring different azure parameters. But its working on Azurite and with Blob access by CLI and connection string/access keys.
@nfoerster2 Thanks for that implementation. While I agree with you that this is not ideal due to complexity of using both DuckDB's filesystem and the Kernel's internal filesystems, I think its a good idea to merge this. The alternative would be to wait for delta-kernel-rs to support letting DuckDB fully handle all filesystem ops, but that could take a little while still and having some azure auth methods working seems like low hanging fruit in the mean time
I would propose to PR your code into the delta extension, I will add some CI jobs based on Azurite and Minio to ensure this can actually be tested. Feel free to open a PR with your code, I can review and merge it. Otherwise I will open one with your changes and some testing later this week.
Sure, I can create a PR. I think there are still some bugs. I added two test cases which are working fine, however I also tested some productive data, its a deltalake with around 1TB data and two layers of partitioning (Serialnumber SN as string and YYYYMM as int, so the pattern for one file of deltalake is partition_sn_yyymm_i5m_v15-3/SN=ZZZZ555/yyyymm=202406/blah.parquet, and it failes during partition discovery. I added below but anonymized the data. The Serialnumber column SN is a string, however it tries to interpret it as an int. I think more complex tests are needed.
D SELECT
*
FROM
delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_i5m_v15-3/')
WHERE SN='XYZ1234';
Invalid Input Error: Failed to cast value: Could not convert string 'ZZZZ555' to INT32
from delta_log:
{"add":{"path":"SN=8XYZ1337/yyyymm=202405/part-00001-d88a25a9-a3f8-4360-9b12-2e737820fa16-c000.zstd.parquet","partitionValues":{"SN":"8XYZ1337","yyyymm":"202405"},"size":...
Then if just go by second level it takes a huge amount of time, I think it does not push to predicate to the partitions correctly:
SELECT
SN
FROM
delta_scan('az://deltalake/delta/k8s/partition_sn_yyymm_interval_5m/')
where yyyymm=202212;
Very nice @samansmink . Been looking forward to this. 😊
select * FROM delta_scan('abfss://<account>.dfs.core.windows.net/<path>/<delta_table>/');
results in:
IOException: IO Error: Hit DeltaKernel FFI error (from: get_default_client in DeltaScanScanBind): Hit error: 5 (GenericError) with message (Generic delta kernel error: Error interacting with object store: Generic parse_url error: feature for MicrosoftAzure not enabled)