Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.61k stars 2.83k forks source link

[Azure ML SDK v2 - AzureML fsspec] Accessing data from azure cloud storage using URI format makes azure-pipelines run indefinitely #28920

Closed lucaseckes closed 1 year ago

lucaseckes commented 1 year ago

Describe the bug I need to access data from azure storage during inference time (i.e. download Azure ML v2 data assets). I am following the steps from this documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview

I am using the method read_pickle of pandas to access the data using its URI because the data is stored as a pickle file.

The code is running perfectly fine in a matter of seconds on my compute instance. However, when used inside azure-pipelines, the pipeline runs until timeout (set to 1h).

When I look at the logs of the azure pipelines I get this message:

test.py::test_data_asset Failed to construct auth object. Exception : AuthenticationError, AML Version: 1.49.0, DataPrep Version: 4.8.6
Message: rslex failed, falling back to clex.
Payload: {"pid": 14584, "rslex_version": "2.15.2", "version": "4.8.6"}
Failed to construct auth object. Exception : AuthenticationError, AML Version: 1.49.0, DataPrep Version: 4.8.6

To Reproduce I created a minimal example to reproduce the issue:

requirements.txt

pandas==1.3.5
pydantic==1.10.5
azureml-fsspec==0.1.0b3
pytest>=7.1.3

config.py

from pydantic import BaseSettings

class Settings(BaseSettings):
    subscription_id: str = ###########
    ressource_group: str = ###########
    workspace: str = ###########
    datastore_name: str = ###########
    storage_path: str = ###########

settings=Settings()
DATA_PATH=(
    f"azureml://subscriptions/{settings.subscription_id}/resourcegroups/"
    f"{settings.ressource_group}/workspaces/{settings.workspace}/datastores/"
    f"{settings.datastore_name}/paths/{storage_path}"
)

azure-pipelines.yml

trigger:
- main

stages:
- stage: test
  displayName: Test
  pool: 'Build Agents'
  jobs:
  - job: test
    displayName: Test
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.7'
    - script: |
        pip install --upgrade pip
      displayName: 'Update pip'
    - script: |
        pip install -r requirements.txt
      displayName: 'Install development dependencies'
    - script: |
        chmod +x run_tests.sh
      displayName: 'Make test script executable'
    - script: |
        ./run_tests.sh
      displayName: 'Run the tests'

test.py

import pandas as pd

from config import DATA_PATH

def test_data_asset():
    data_asset = pd.read_pickle(DATA_PATH)

    assert "name" in data_asset.columns

run_tests.sh

#!/bin/sh
set -eu

pytest -s -vv test.py

Additional context This related issue is about the best practice to download data assets locally: https://github.com/Azure/azure-sdk-for-python/issues/26213

ghost commented 1 year ago

Thank you for your feedback. This has been routed to the support team for assistance.

nthandeMS commented 1 year ago

@vipinnair22, can you investigate?

vipinnair22 commented 1 year ago

@ChunyuMSFT, @shuyums2, @QianqianNie could you please help with this?

FeiDeng commented 1 year ago

Seems the different is how you setup the authentication. @lucaseckes Can you share more details about how the identity is setup on your compute instance vs pipeline?

Thanks.

lucaseckes commented 1 year ago

The pipeline agent is a self-hosted agent. It seems from the documentation that the agent uses a Personal Access Token (PAT) to authenticate. For my compute instance, I use managed identities with Azure AD to download data from Azure storage.

Is that answering your question @FeiDeng ? This is all I know concerning authentication. If you need more information, let me know how to find it.

Thanks.

FeiDeng commented 1 year ago

@lucaseckes , Thank you for the update. We want to make sure the pipeline agent's identity can access the Azure storage first. In theory, if it can access the Azure storage, URI should also work.

lucaseckes commented 1 year ago

Before with the Azure ML SDK v1, I was using the method get_by_name of the Dataset class: https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#azureml-core-dataset-dataset-get-by-name

It was working perfectly fine inside the azure pipelines so I guess the pipeline agent has access to Azure storage.

In case it hasn't access to Azure storage, can you please tell me how to give it access? Thanks

FeiDeng commented 1 year ago

Hold on need to check with Pipeline team.

FeiDeng commented 1 year ago

@lucaseckes by the way, do you have a a run Id for the failed runs? We want to check some logs.

lucaseckes commented 1 year ago

Unfortunately, I can't give you the run id because of privacy reason. However, you should be able to reproduce the issue with the example I gave.

FeiDeng commented 1 year ago

I think we mask the privacy related data in logs and run Id is auto-generated guid, So should be ok? The reason I ask for your run id is as this mostly related to how the credential is setup. We can't reproduce that part.

lucaseckes commented 1 year ago

Hi @FeiDeng, these are the logs of the failed pipeline with the minimal example. I hope this will help you resolve the issue logs_5370.zip

FeiDeng commented 1 year ago

Checked. I think we may need more logs to investigate more on this.

lucaseckes commented 1 year ago

Hi @FeiDeng , what kind of logs do you want?

I can rerun the pipeline and give you the logs but I guess it will be the same.

lucaseckes commented 1 year ago

Do you have any updates on this issue?

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github, @Azure/azure-ml-sdk.

Issue Details
- **Package Name=azureml-fsspec**: - **Package Version=0.1.0b3**: - **Python Version=3.7.16**: **Describe the bug** I need to access data from azure storage during inference time (i.e. download Azure ML v2 data assets). I am following the steps from this documentation: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls#access-data-from-a-datastore-uri-like-a-filesystem-preview I am using the method `read_pickle` of pandas to access the data using its URI because the data is stored as a pickle file. The code is running perfectly fine in a matter of seconds on my compute instance. However, when used inside azure-pipelines, the pipeline runs until timeout (set to 1h). When I look at the logs of the azure pipelines I get this message: ``` test.py::test_data_asset Failed to construct auth object. Exception : AuthenticationError, AML Version: 1.49.0, DataPrep Version: 4.8.6 Message: rslex failed, falling back to clex. Payload: {"pid": 14584, "rslex_version": "2.15.2", "version": "4.8.6"} Failed to construct auth object. Exception : AuthenticationError, AML Version: 1.49.0, DataPrep Version: 4.8.6 ``` **To Reproduce** I created a minimal example to reproduce the issue: #### **`requirements.txt`** ``` pandas==1.3.5 pydantic==1.10.5 azureml-fsspec==0.1.0b3 pytest>=7.1.3 ``` #### **`config.py`** ``` from pydantic import BaseSettings class Settings(BaseSettings): subscription_id: str = ########### ressource_group: str = ########### workspace: str = ########### datastore_name: str = ########### storage_path: str = ########### settings=Settings() DATA_PATH=( f"azureml://subscriptions/{settings.subscription_id}/resourcegroups/" f"{settings.ressource_group}/workspaces/{settings.workspace}/datastores/" f"{settings.datastore_name}/paths/{storage_path}" ) ``` #### **`azure-pipelines.yml`** ``` trigger: - main stages: - stage: test displayName: Test pool: 'Build Agents' jobs: - job: test displayName: Test steps: - task: UsePythonVersion@0 inputs: versionSpec: '3.7' - script: | pip install --upgrade pip displayName: 'Update pip' - script: | pip install -r requirements.txt displayName: 'Install development dependencies' - script: | chmod +x run_tests.sh displayName: 'Make test script executable' - script: | ./run_tests.sh displayName: 'Run the tests' ``` #### **`test.py`** ``` import pandas as pd from config import DATA_PATH def test_data_asset(): data_asset = pd.read_pickle(DATA_PATH) assert "name" in data_asset.columns ``` #### **`run_tests.sh`** ``` #!/bin/sh set -eu pytest -s -vv test.py ``` **Additional context** This related issue is about the best practice to download data assets locally: https://github.com/Azure/azure-sdk-for-python/issues/26213
Author: lucaseckes
Assignees: luigiw
Labels: `question`, `Machine Learning`, `Service Attention`, `customer-reported`, `needs-team-attention`, `CXP Attention`
Milestone: -
FeiDeng commented 1 year ago

We need the run_id or session Id. That is the only way we can check from our backend logs.

fdroessler commented 1 year ago

Picking this up together with Lucas.

Are there any details on how fsspec handles auth? When I run on an AML compute instance and use fsspec I get the following:

>> fs.ls()
Warning: Falling back to use azure cli login credentials.
If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.

My understanding is MsiAuthentication is v1? should it not be DefaultAzureCredential or ManagedIdentityCredential for v2? It seems like the job in our case is running forever because it is going to the interactive login option. However, on the same worker pool the v1 auth with the managed identity works flawless.

In either case I am not sure how I can "force" suggest using ManagedIdentity on a non-interactive job. Is there any documentation on how azureml-fsspec handles auth flow in the background? It does not seem to use install https://github.com/fsspec/adlfs but use the data-prep? Some details here would be helpful.

github-actions[bot] commented 1 year ago

Hi @lucaseckes, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

FeiDeng commented 1 year ago

@fdroessler Fsspec will check the datastore credential first. It wont use the interactive login if credentials are provided in the datastore setup. If it is credential-less datastore, interactive login is forced.

fdroessler commented 1 year ago

@FeiDeng could you elaborate on this? The managed identity on the AzureDevops agent has access to the datastore/dataset. This works without any problems with the v1 SDK. "Just" changing the SDK to v2 and using fsspec results in the interactive login. How would the environment on an AzureDevops worker need to be setup such that a managed identity would be picked up? I can't find any details on the auth flow or other details on this site (https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?tabs=adls&view=azureml-api-2#access-data-from-a-datastore-uri-like-a-filesystem-preview)

FeiDeng commented 1 year ago

@fdroessler , just to confirm, your identity have the access to workspace right? And how this the datastore connection setup from workspace. Which credentials it is using?

fdroessler commented 1 year ago

@FeiDeng yep that is how it worked with the v1 sdk so hence it must have access.

Allowed workspace managed identity access
Yes
Authentication type
Account key
FeiDeng commented 1 year ago

That is very interesting. Trying to have repro for this case. So which type of datastore is used here? Also mind to share the runId? or session Id?

FeiDeng commented 1 year ago

also which version of fsspec and azureml-core installed?

fdroessler commented 1 year ago

Ok so I have the following setup that might help you reproduce. In the pipeline below, stage Testv1 runs through without any issues, Testv2 ends up with the interactive login prompt: To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXXXXX to authenticate.

Datastore type: Azure Blob Storage

Please share an e-mail address to which I can share the run and session ids.

content of test_v1.py:

from azureml.core import Dataset, Workspace
from azureml.core.authentication import MsiAuthentication
import pandas as pd
from config import settings

msi_auth = MsiAuthentication()

workspace = Workspace(subscription_id=settings.subscription_id, resource_group=settings.resource_group, workspace_name=settings.workspace, auth=msi_auth)
dataset = Dataset.get_by_name(workspace, name="test_v1", version="1")
dataset.download(target_path="./test", overwrite=True)
data = pd.read_csv("./test/test.csv")
assert "test" in data.columns

content of requirements_v1.txt:

azureml-core==1.48.0
azureml-pipeline==1.48.0
pandas==1.3.5
pydantic==1.10.5

content of test_v2.py:

import pandas as pd
from config import DATA_PATH

# DATA_PATH points to the v2 azureml:// uri of the same file as above
# DATA_PATH=(
#    f"azureml://subscriptions/{settings.subscription_id}/resourcegroups/"
#     f"{settings.resource_group}/workspaces/{settings.workspace}/datastores/"
#     f"{settings.datastore_name}/paths/flavor-optimisation/ingredients-data/"
#     f"{settings.dataset_modified_date}/test.pkl.gz"
# )
data_asset = pd.read_pickle(DATA_PATH)
assert "test" in data_asset.columns

content of requirements_v2.txt:

pandas==1.3.5
pydantic==1.10.5
azureml-fsspec==0.1.0b3
pytest>=7.1.3

content of azure-pipelines.yml:

trigger:
- main

stages:
- stage: testv1
  displayName: Testv1
  pool: 'Build Agents'
  jobs:
  - job: testv1
    displayName: TestV1
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.9'
    - script: |
        pip install --upgrade pip
      displayName: 'Update pip'
    - script: |
        pip install -r requirements_v1.txt
      displayName: 'Install development dependencies'
    - script: |
        python test_v1.py
      displayName: 'Run the tests'

- stage: testv2
  displayName: Testv2
  pool: 'Build Agents'
  jobs:
  - job: testv2
    displayName: Testv2
    steps:
    - task: UsePythonVersion@0
      inputs:
        versionSpec: '3.9'
    - script: |
        pip install --upgrade pip
      displayName: 'Update pip'
    - script: |
        pip install -r requirements_v2.txt
      displayName: 'Install development dependencies'
    - script: |
        python test_v2.py
      displayName: 'Run the tests'
fdroessler commented 1 year ago

@FeiDeng any luck reproducing the above case on your end?

FeiDeng commented 1 year ago

Thanks for the detail. Looks like this part is missing for fsspec. We will add this support soon. Thank you.

github-actions[bot] commented 1 year ago

Hi @lucaseckes, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

fdroessler commented 1 year ago

@FeiDeng any news on this? Can we keep this open?

FeiDeng commented 1 year ago

Still actively working on this change. Will let you when released.