delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
2.16k stars 386 forks source link

Python: Support Managed System Identity as an Azure authentication method #662

Closed kk921dbg closed 7 months ago

kk921dbg commented 2 years ago

Description

Please can you look into supporting MSI in Azure as it is much more secure than Service Principal and Account Key for enterprises as no passwords need storing.

Use Case Connect to adls gen 2 from an azure App Service that has MSI and MSI has RBAC role/or ACLs on data lake and firewall permissions to connect to storage account. Would like to use a ManagedIdentityCredential to authenticate for the MSI like below:

managed_identity = ManagedIdentityCredential() credential_chain = ChainedTokenCredential(managed_identity) client = DataLakeServiceClient(STORAGE_ACCOUNT_URL, credential=credential_chain) file_system_client=client.get_file_system_client("storageaccount") Related Issue(s)

roeap commented 2 years ago

Hi @kk921dbg - providing support MSI auth is actually quite straight forward (at least should be), however we would not be able to let you create an account client on the Python side, as the client for reading writing the log is created on the rust side of this.

Configuration would need to be provided via the storage_options that can be passed to the DeltaTable. There is one thing for us to figure out though. AFAIK you can provide a client_id, object_id, or msi_resource_id to provide the identity you want to use for MSI auth. in case of the client id we need to disambiguate between client and msi auth, but that should be doable.

We also allow you to pass a file system object to some of the reading / writing operations on the python side, but that only relates to the parquet files in the table, not the log itself.

FYI - wrapping the credential in a chained credential as in your example is superfluous, if you are only providing a single credential. The chained credential is just a wrapper that iterates through the passed credentials until it finds one that works.

I can have a look into supporting this, but ill need some time, as there are currently a few things on my plate :).

kk921dbg commented 2 years ago

@roeap This would be most secure and compliant for our automated process that we use in our large company

johnayoub commented 1 year ago

@roeap Any update on when managed Identity could be supported? It's pretty much the standard nowadays.

roeap commented 1 year ago

@johnayoub - could you describe your deployment scenario a bit more? In my context we found managed identities to be not so useful, since they are tied to vm's and thus not useable for containerized / cloud-native workloads. depending on your scenario though there might be a way to make that work today.

That said, I am hoping to contribute both managed indentity auth as well as workload identities upstream within Q1-23. They then become availabe here as well.

johnayoub commented 1 year ago

@roeap We are planning on connecting to an Azure Storage account via a python function app to convert some small delta table datasets to excel/csv. For us connecting via the function managed identity would be best since we don't have to provision and maintain credentials for a service principal.

Glad to hear that this is planned :)

roeap commented 1 year ago

One possibility that would work right now, is to use the managed identity to acquire a SAS key and use that to authorize the deltalake package. Obviously not ideal, but would work today :). Here you would have to consider the lifetime of the function. i.e. do you expect it to be always-on or would it scale to zero and up again after a while.

If the function runs infrequently, you could also just get a token on the python side and pass that along.

Just as a stop-gap until more azure authentications are supported ...

johnayoub commented 1 year ago

@roeap The function would run on demand mostly likely via an HTTP trigger or a queue one. It would be very short lived less than a minute I would imagine for each invocation. Integrating with the Azure.Identity package would be ideal since it manages the token caching and renewal behind the scenes.

When you say I can get a token and pass it, do you mean I can pass an AAD access token? If so, how can I do that?

Happy new year 🎉

roeap commented 1 year ago

@kk921dbg @johnayoub - sorry for not getting back to this issue sooner... we have been supporting managed and workload identities for a while now. Are you able to use that with the latest releases?

sugibuchi commented 1 year ago

@roeap We tried to use MSI-based authentication in a AKS cluster enabling AAD Pod Identity. But it failed with 403 errors from IMDS token endpoint.

This error looks caused by a wrong query parameter sent to IMDS endpoint. Could you please look at the following issue?

https://github.com/apache/arrow-rs/issues/4096

roeap commented 1 year ago

Thanks for the context @sugibuchi. I'll look into it.

rickyschools commented 1 year ago

This is a feature I would desire to use. It's recommended by Azure to use identities for authentication over any form of key based authentication. This azure library internally provides a way to go from leveraging managed identities to developer credentials, without the end user being any the wiser.

https://pypi.org/project/azure-identity/

Edit: I was re-reading comments and saw that this operation needed to happen on the Rust and not python side. Apologies for skimming over that detail.

There seems to be an available rust crate for azure-identity api wrappers. It's worth noting that this isn't officially supported by the Azure SDK team.

https://crates.io/crates/azure_identity

roeap commented 7 months ago

closing this, as it has been fixed upstream and should work now with delta-rs.

Feel free to re-open if the error persists.