duckdb / duckdb_azure

Azure extension for DuckDB
MIT License
51 stars 17 forks source link

Add support for azure credential chain #16

Closed samansmink closed 1 year ago

samansmink commented 1 year ago

This PR adds some improvements to how to authenticate in the Azure extension.

This is the first step towards implementing the feature request in https://github.com/duckdblabs/duckdb_azure/issues/6.

Interface changes

The complete set of parameters for the extension is now:

To authenticate the requests, the extension does, in this order:

  1. check if the connection string is set, if so, use that.
  2. check if the account name is set. If so, it:
    • checks the azure_credential_chain for which credential providers to check in order
    • the default is 'none' which means no authentication will be used.
    • the user can override the chain by specifying in a ; separated string one or more of the following: cli, env, managed_identity, default or none. Checkout the Azure docs on how to use them (https://github.com/Azure/azure-sdk-for-cpp/blob/main/sdk/identity/azure-identity/README.md#environment-variables). For example to use the full DefaultAzureCredential chain, set it to 'default'
    • the endpoint defaults to blob.core.windows.net but can be overridden through azure_endpoint

Examples

These changes simplify some things:

Public containers:

For fully public containers, reading and listing is now much cleaner not needing to set :

set azure_account_name='<azure storage account name>';
set azure_credential_chain = 'none';
SELECT * FROM 'azure://<container>/<path>/<to>/<file>.parquet';
SELECT * FROM 'azure://<container>/<path>/<to>/*.parquet';

CLI

Authentication using the cli can now work as follows:

Firstly, using the azure cli log in to your azure account using:

az login

This will forward you to a browser and ask you to login.

Now from duckdb, run:

set azure_account_name='<azure storage account name>';
set azure_credential_chain = 'cli';
SELECT * FROM 'azure://<container>/<path>/<to>/<file>.parquet';

TODO

Much more testing. The CLI authentication is tested but only locally, not in CI atm. The few other credential providers that are included in this PR (env, managed_identity), they are fully untested.

However, I feel like this is quite finicky to test properly test anyway, due to the different types of environments that you need to setup to test this. Since its only a little amount of code on DuckDB side with just a few lines, and the CLI log-in working very well, I think pushing this into the nightly repo allowing people to try it out on 0.9.1 is a nice way forward. Also to try out the whole workflow of deploying nightly releases for extensions allowing people to use a stable duckdb + an unstable/nightly extension.

Of course the next step is to add more testing infrastructure:

samansmink commented 1 year ago

Thanks for the review @carlopi! There is no requirement on httpfs! the azure sdk ships its own http stuff.

The account name is a "storage account" name which is part of the url for the objects: https://<account_name>.blob.core.windows.net/<container_name>/file.ext so that is required afaik!

will check out the rest tomorrow!

carlopi commented 1 year ago

Another question: would it not be nice to offer a load_azure_credentials('my_name', 'cli') option?

samansmink commented 1 year ago

@carlopi That's a good point. Indeed we can pass a profile name to the cli credential provider. Currently it just picks the default.

For this PR i would argue this is good enough. We can consider improving this in a follow up

In the grand scheme of things: this comes down to the major question of proper API for table scans with credentials. What we would like to support is to basically be able to create and switch between multiple sets of credentials