duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
51 stars 17 forks source link

Add support for abfss://container@storage.endpoint/path URL syntax #72

Closed gdubya closed 2 months ago

gdubya commented 3 months ago

While playing with the uc_catalog and delta extensions I encountered an error while trying to read from an ADLS gen2 account using the abfss syntax in this format: abfss://container@storage.endpoint/path.

This PR corrects the URL parser to account for this additional variation.

Linked issues:

  1. https://github.com/duckdb/duckdb_delta/issues/71
  2. https://github.com/duckdb/uc_catalog/issues/7
  3. https://github.com/duckdb/duckdb_azure/issues/71
sugibuchi commented 3 months ago

Could you please consider supporting abfs:// in addition to abfss:// as well?

abfs:// is commonly used with the Hadoop ABFS support (the original implementation of ABFS). There is no difference between abfs:// and abfss:// as the Hadoop ABFS support enforces TLS by default.

gdubya commented 3 months ago

@sugibuchi I'm happy to try adding that, but I would like to do it as a separate PR to limit the scope of what's affected here. The abfss string is used quite a few places so it will be a bit of work to add support for abfs as well, as simple as it sounds.

gdubya commented 3 months ago

@samansmink Could you have a look at this PR and merge if all looks ok? Apologies if it's not you that I should ping. I just looked at the other open PR. Maybe there should be a CODEOWNERS file in the repo?

quentingodeau commented 3 months ago

Hello, I had in the past allow the syntax that you are proposing here but remove it because how scope for secret work. The scope when defined on secret return the score with the path that the score is simply the common distance between path and scope. So if you want to define a secret for a all storage account with the container@sa you cannot anymore.

From my point a view I do not think that a good idea but maybe I a particular user that use too many storage account.

A solution for this can also be to change the way secrets score is compute and allow wildcard.

Regards Quentin

gdubya commented 3 months ago

@quentingodeau I see what you mean. But given how important support for this syntax is, would it be ok to accept this change and document the limitations with regards to the scopes? This change should not affect existing functionality, only add some more functionality (which is slightly limited with regards to the scopes, but better than nothing, IMHO).

quentingodeau commented 3 months ago

Yes all abfss path are in fact mostly manage this way. I was just to point out the limitations of the syntax. I really think that we should adapt the way score are compute in the secret management. I have not enough time at the moment but if it is not done before I will try to take a look. Also at the end I think we should only keep one syntax (but that only my opinion) Regards

sugibuchi commented 3 months ago

@gdubya

I'm happy to try adding that, but I would like to do it as a separate PR

OK. I understand and agree to this.

sugibuchi commented 3 months ago

Also at the end I think we should only keep one syntax (but that only my opinion)

If we should keep only one syntax, please consider keeping the Hadoop compatible syntax.

For historical reasons, each ABFS filesystem implementation supports a different set of ABFS URL syntax.

Apache Hadoop's hadoop-azure package link

Python adlfs for fsspec link

Rust object_store::azure link

Apache Arrow link

DuckDB azure extension link

This inconsistency of supported ABFS URL syntax is a source of headaches, particularly when we combine different implementations of the ABFS file system (for example, extracting data from a massive table using Spark, then interactively analysing it with Polars, etc.).

Fortunately, the ABFS URL syntax defined by Hadoop's hadoop-azure (the original ABFS implementation) can be a safe choice when we combine different ABFS implementations as this syntax is widely supported by almost all ABFS implementations, except the DuckDB Azure extension.

To maximize interoperability with other frameworks/libraries, the support of the Hadoop-compatible syntax should be a high priority.

gdubya commented 3 months ago

@sugibuchi Thank you for the summary of the various options. It is quite a mess! I agree with the conclusion that the ABFS syntax should be the main priority.

Hopefully we can merge this PR now, and then any decisions made about futher changes can be handled in a different one?

samansmink commented 2 months ago

all green 🟢

samansmink commented 2 months ago

@sugibuchi your suggestion is now implemented

thanks @gdubya!