duckdb / duckdb_delta

DuckDB extension for Delta Lake
MIT License
136 stars 14 forks source link

register_filesystem has no effect after v1.1.0 upgrade #87

Open ozen opened 1 month ago

ozen commented 1 month ago

Using fsspec Filesystems used to work for me when using delta_scan. Now it doesn't, and the reason appears to be the version upgrade.

This code works as expected:

import duckdb
from fsspec import filesystem

duckdb.register_filesystem(filesystem('gcs'))
duckdb.sql("SELECT * FROM read_csv('gcs:///bucket/file.csv')")

This code used to work, now raises an exception:

import duckdb
from fsspec import filesystem

duckdb.register_filesystem(filesystem('gcs'))
duckdb.sql("SELECT * FROM delta_scan('gcs:///bucket/table')")

The exception:

NotImplementedException: Not implemented Error: Can not scan a gcs:// gs:// or r2:// url without a secret providing its endpoint currently. Please create an R2 or GCS secret containing the credentials for this endpoint and try again.

I think register_filesystem must have priority over builtin filesystems.

samansmink commented 1 month ago

Hey @ozen thanks for reporting, this is indeed something that is currently slightly quirky, however the error message provides you with a hint on the workaround, because createing a GCS type secret should make this work. Not that you should not need fsspec here.

import duckdb
duckdb.sql("CREATE SECRET gcs1 (TYPE GCS)")
duckdb.sql("SELECT * FROM read_csv('gcs://bucket/file.csv')")

Also note that using fsspec with authentication will currently not work at all because of the way part of IO is currently handled by the kernel using its internal cloud storage libaries, while the other part is handled through DuckDB. This means that any auth you configure through fsspec will not be propagated to the kernel.

Either way I will look into removing the need for the empty gcs secret here.

ozen commented 1 month ago

@samansmink thanks for the detailed answer.

From an enterprise standpoint, there are considerable differences between using HMAC keys with interoperatibility layer and using standard methods of GCP authentication. I think not every user will simply be able to use HMAC keys. fsspec provides the way to use GCP authentication schemes.

Is there any way to move the IO from the kernel to duckdb?

samansmink commented 1 month ago

Well I think it may have worked accidentally before, but only on public data. I don't really see how authentication wouldve worked there

Is there any way to move the IO from the kernel to duckdb?

Yes! This is actually what the peeps over at the delta-kernel-rs project are working on right now. So currently DuckDB relies on the kernel to do IO for things like metadata reads, deletion vector reads, checkpoints etc. However the idea is that kernel will support APIs in the future to ensure DuckDB can do all IO itself. This will allow us to remove the convoluted code in https://github.com/duckdb/duckdb_delta/blob/24d9b782b1da7676e4c8aae7b9d7650cb035276c/src/functions/delta_scan.cpp#L115 that we now require as well.

With that, we will be able to support using fsspec for delta cleanly

ozen commented 1 month ago

@samansmink Thank you again for the detailed explanation. Great to hear that!