fsspec / adlfs

fsspec-compatible Azure Datake and Azure Blob Storage access
BSD 3-Clause "New" or "Revised" License
174 stars 104 forks source link

DefaultAzureCredential() not being used with anon=False keyword passed in storage_options #411

Open Seb-Unit8 opened 1 year ago

Seb-Unit8 commented 1 year ago

Versions

Summary:

Hello,

I am not experiencing the expected behaviour introduced in #262 and documented in the project's README > Details > Setting credentials > 2: "2. Auto credential solving using Azure's DefaultAzureCredential() library: storage_options={'account_name': ACCOUNT_NAME, 'anon': False} will use DefaultAzureCredential to get valid credentials to the container ACCOUNT_NAME. DefaultAzureCredential attempts to authenticate via the mechanisms and order visualized here."

The following code snippet outputs the expected return of the containers list:

from azure.storage.blob import BlobServiceClient
from azure.identity import DefaultAzureCredential
name : str = "<redacted>"
print([a for a in BlobServiceClient(f"https://{name}.blob.core.windows.net/", DefaultAzureCredential()).list_containers()])

verifying that the managed identity for this VM has the right permissions (Storage Blob Data Contributor).

However, the following code

import fsspec
container : str = "<redacted>"
subpath : str = "<redacted>"
fallback_options = {"account_name":f"{name}", "anon": False}
fsspec.filesystem("az", storage_options=fallback_options)
fsspec.get_mapper(f"az://{container}/{subpath}", storage_options = fallback_options})

run in the same environment throws the error: ValueError: unable to connect to account for Must provide either a connection_string or account_name with credentials!!

Is anyone able to identify why the DefaultAzureCredential fallback is not being triggered even though I have specified the anon=False keyword?

Thanks for any help.

charmoniumQ commented 1 year ago

Here is a workaround that works for me:

import adlfs
import azure.identity.aio
abfs = adlfs.AzureBlobFileStorage(account_name=account_name, credential=azure.identity.aio.DefaultAzureCredential())
abfs.ls(container_name + "/" + subpath)
mikwieczorek commented 1 year ago

I encountered the same problem when running code that uses adlfs on ComputeInstance (CI) in AzureML with User-managed identity. The identity has correct permission, which I can confirm running:

az login --identity --username xxx
az storage blob list --account-name SANAME --container-name MYCONTAINER --output table

However, it seems that automatic credentials resolution takes SystemAssigned Identity instead of User-manged identity assigned to the CI. Looking into DefaultCredentials Resolution Order Managed-identity should be correctly resolved, but it is not.

It seems like CI always have SystemAssigned Identity (?) and it may take precedence over User-managed identity. Digging into Azure identity python SDK it seems like setting a single environment variable should work and it indeed does:

import os
os.environ['AZURE_CLIENT_ID'] = 'xxx'
storage_options = {'account_name': SANAME, 'anon': False}
ddf = dd.read_parquet('az://MYCONTAINER/*.csv', storage_options=storage_options)

What would be nice for adlfs is an option to provide two arguments to storage_options, namely: storage_options = {'account_name': SANAME, 'client_id': 'xxx'} and as a result passed client_id should be used to fetch credentials. Currently such combination results in error: ValueError: secret should be an Azure Active Directory application's client secret