Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.61k stars 2.82k forks source link

AzureMachineLearningFileSystem does not work with user Entra authentication #37089

Open cecheta opened 2 months ago

cecheta commented 2 months ago

Describe the bug When trying to access a data asset using Microsoft Entra authentication, AzureMachineLearningFileSystem is unable to authenticate when running locally.

To Reproduce Steps to reproduce the behavior:

  1. Create Machine Learning workspace
    • Ensure that "Identity-based access" is enabled for the storage account

image

  1. Assign "Storage Blob Data Contributor" role for the storage account to the current user
  2. In Azure ML, create a data asset using the workspaceblobstore datastore
  3. Copy the Datastore URI
  4. Locally, run pip install azureml-fsspec azure-ai-ml
  5. Run the following script locally, replacing <DATASTORE_URI>
from azureml.fsspec import AzureMachineLearningFileSystem

file_system = AzureMachineLearningFileSystem("<DATASTORE_URI>")

print(file_system.ls("/"))

Expected behavior The contents of the datastore should be printed.

Actual behaviour A browser window is launched, prompting the user to log in. This happens twice, then the following error occurs:

Resolving access token for scope "https://storage.azure.com/.default" using identity of type "USER".
Resolving access token for scope "https://storage.azure.com/.default" using identity of type "USER".
Traceback (most recent call last):
  File "C:\Users\user\Downloads\main.py", line 7, in <module>
    file_system.ls()
  File "C:\Users\user\Downloads\.venv\Lib\site-packages\fsspec\asyn.py", line 118, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\Downloads\.venv\Lib\site-packages\fsspec\asyn.py", line 103, in sync
    raise return_result
  File "C:\Users\user\Downloads\.venv\Lib\site-packages\fsspec\asyn.py", line 56, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "C:\Users\user\Downloads\.venv\Lib\site-packages\azureml\fsspec\spec.py", line 428, in _ls
    _reclassify_rslex_error(e)
  File "C:\Users\user\Downloads\.venv\Lib\site-packages\azureml\dataprep\api\mltable\_validation_and_error_handler.py", line 90, in _reclassify_rslex_error
    raise err
  File "C:\Users\user\Downloads\.venv\Lib\site-packages\azureml\fsspec\spec.py", line 407, in _ls
    entrys = uri_accessor.list_directory(path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: [UriAccessor::list_directory] fails with error: PermissionDenied(Some(The authentication information was not provided in the correct format. Verify the value of Authorization header.))

Additional context If you create a compute instance and launch VS Code within the compute instance, the script works in that environment, using the compute's managed identity.

github-actions[bot] commented 2 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Azure/azure-ml-sdk @azureml-github.

kristapratico commented 2 months ago

@cecheta thanks for your issue, the team will take a look and get back to you as soon as possible.

sharmuz commented 2 months ago

I'm experiencing the same issue when trying to read a datastore or data asset using azureml-fsspec / mltable from my local machine.

I'm using the code snippets Azure ML Studio provides; for reading a datastore:

import pandas as pd
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())

uri = "azureml://subscriptions/<my_sub_id>/resourcegroups/<my_rg>/workspaces/<my_aml_ws>/datastores/my-datastore/paths/my-dir/some_data.csv"

df = pd.read_csv(uri)
df

and for reading a data asset:

import mltable
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential())
data_asset = ml_client.data.get("my-asset", version="1")

path = {
  'folder': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])
df = tbl.to_pandas_dataframe()
df

The only difference is that I'm not creating my MLClient via from_config but rather by loading in and passing env vars. I don't believe this is the problem since using the same MLClient object to create a datastore or data asset, or to submit a job works without issue. Authentication is still using DefaultAzureCredential, which correctly uses my az login token.

In either case browser-based authentication is triggered when the read is attempted. If I comply the read executes as expected. If I decline it fails as @cecheta notes above. So it does not seem to be respecting my az login token.

xngli commented 1 week ago

I'm getting the same error. Any updates on this issue @kristapratico ?

kristapratico commented 1 week ago

@azureml-github @FeiDeng friendly ping on this issue

tarockey commented 5 hours ago

I'd just like to pile on - I'm seeing the same behavior - except now when the browser prompt comes up, even if I accept, I get "StreamError(PermissionDenied(Some(The authentication information was not provided in the correct format. Verify the value of Authorization header.)))" when trying to do pd.read_csv(AZUREMLDATASTOREURI)