Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.39k stars 2.72k forks source link

Azure ML, Data output path auto-includes /INPUT_input_data/ #31734

Open BenHaanstra opened 11 months ago

BenHaanstra commented 11 months ago

Working on Azure ML Studio, using Python 3.10 SDK Kernel

Describe the bug My source folder contains 4 csvs on "/input/data" and my destination folder output "/output/data" Using command="cp -r ${{inputs.input_data}} ${{outputs.output_data}}" I then get an additional folder in my path "/output/data/INPUT_input_data" rather than just the 4 csvs. I tried various settings, but INPUT_input_data is always somehow imputed.

To Reproduce

  1. Using the tutorial for accessing and writing data, I wanted to test some folder to folder copying operations. For that I created an Azure storage container with hierarchical namespace enabled and created the folders: input/data output/data

  2. I used the heartclassifier data, downloaded it from https://github.com/Azure/azureml-examples/tree/main/sdk/python/endpoints/batch/deploy-models/heart-classifier-mlflow/data and uploaded it to input/data

  3. Then I used the access data tutorial: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-read-write-data-v2?view=azureml-api-2&tabs=python#write-data-from-your-azure-machine-learning-job-to-azure-storage and then change it to a folder to folder scenario, as per the code below

from azure.ai.ml import command, Input, Output, MLClient
from azure.ai.ml.constants import AssetTypes, InputOutputModes
from azure.identity import DefaultAzureCredential

# Set your subscription, resource group and workspace name:
subscription_id = "<SUBSCRIPTION_ID>"
resource_group = "<RESOURCE_GROUP>"
workspace = "<AML_WORKSPACE_NAME>"

# connect to the AzureML workspace
ml_client = MLClient(
    DefaultAzureCredential(), subscription_id, resource_group, workspace
)
input_path = "abfss://***@***.dfs.core.windows.net/input/data"
output_path = "abfss://***@***.dfs.core.windows.net/output/data"

data_type_in = AssetTypes.URI_FOLDER
data_type_out = AssetTypes.URI_FOLDER
input_mode = InputOutputModes.RO_MOUNT
output_mode = InputOutputModes.RW_MOUNT

# Set the input and output for the job:
inputs = {
    "input_data": Input(type=data_type_in, path=input_path, mode=input_mode)
}

outputs = {
    "output_data": Output(type=data_type_out, 
                          path=output_path, 
                          mode=output_mode,
                  )
}

# This command job copies the data to your default Datastore
job = command(
    command="cp -r ${{inputs.input_data}} ${{outputs.output_data}}", # folder > folder
    inputs=inputs,
    outputs=outputs,
    environment="azureml://registries/azureml/environments/sklearn-1.1/versions/4",
    compute=compute_target,
)

# Submit the command
ml_client.jobs.create_or_update(job)
  1. The azure storage container will then have a folder "/output/data/INPUT_input_data" with its content of heart classifier But I do not specify anywhere that it should use INPUT_input_data.

Expected behavior I simply wanted 4 csvs in /output/data, with no extra folder being added. Just like in https://github.com/Azure/azureml-examples/tree/main/sdk/python/endpoints/batch/deploy-models/heart-classifier-mlflow/data

Additional context I tried about 20 different variants and also tried to get history outputted to a dummyfile but that turned out to be empty.

github-actions[bot] commented 11 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.

kristapratico commented 11 months ago

Hi @BenHaanstra thanks for your feedback, @azureml-github please help take a look.