[Azure ML SDK v2] File is not written to output azureml datastore

mbecker commented 1 year ago

Package Name: azure.ai.ml
Package Version: latest in Azure ML Notebooks (Standard)
Operating System: Azure ML Notebooks (Standard)
Python Version: Azure ML Notebooks (Standard)

Describe the bug The Azure ML datastore tfconfigs has multiple files in the base path.

For a pipeline job the Azure ML Datastore tfconfig is defined as an output to write data:

update_config_component = command(
    name="tf_config_update",
    display_name="Tensorflow configuration file update",
    description="Reads the pipeline configuration file from a specific model (directory), updates it with the params, and saves the new pipleine config file to the output directory",
    inputs=dict(
        config_dir=Input(type="uri_folder"),
        config_directory_name=Input(type="string"),
        images_dir=Input(type="uri_folder"),
        labelmap_path=Input(type="string"),
        fine_tune_checkpoint_type=Input(type="string"),
        fine_tune_checkpoint=Input(type="string"),
        train_record_path=Input(type="string"),
        test_record_path=Input(type="string"),
        num_classes=Input(type="integer"),
        batch_size=Input(type="integer"),
        num_steps=Input(type="integer"),
    ),
    outputs = {
            "config_directory_output": Output(
                type=AssetTypes.URI_FOLDER,
                path=f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.resource_group_name}/datastores/tfconfigs/paths/",
        )
    },
    # The source folder of the component
    code=update_config_src_dir,
    command="""pwd && ls -la ${{outputs.config_directory_output}} && python update.py \
               --config_dir ${{inputs.config_dir}} \
               --config_directory_name ${{inputs.config_directory_name}} \
               --config_directory_output ${{outputs.config_directory_output}} \
               --images_dir ${{inputs.images_dir}} \
               --labelmap_path ${{inputs.labelmap_path}} \
               --fine_tune_checkpoint_type ${{inputs.fine_tune_checkpoint_type}} \
               --fine_tune_checkpoint ${{inputs.fine_tune_checkpoint}} \
               --train_record_path ${{inputs.train_record_path}} \
               --test_record_path ${{inputs.test_record_path}} \
               --num_classes ${{inputs.num_classes}} \
               --batch_size ${{inputs.batch_size}} \
               --num_steps ${{inputs.num_steps}} \
            """,
    environment="azureml://registries/azureml/environments/AzureML-minimal-ubuntu18.04-py37-cpu-inference/versions/43",
)

The output config_directory_output is mounted the computing engine execution as follows:

/mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output

At the beginning of the python-script the output directory is listed as follows:

print("Listing path / dir: ", args.config_directory_output)
arr = os.listdir(args.config_directory_output)
print(arr)

The directory does not include any files:

Listing path / dir:  /mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
[]

BUG: The Azure ML Datastore tfconfig mounted as an output includes multiple files already uploaded manually.

At the end of the python script a config-file is written to the mounted output and the directiry is listed again as follows:

with open(pipeline_config_path, "r") as f:
    config = f.read()

with open(new_pipeline_config_path, 'w') as f:

  # Set labelmap path
  config = re.sub('label_map_path: ".*?"', 
             'label_map_path: "{}"'.format(images_dir_labelmap_path), config)

  # Set fine_tune_checkpoint path
  config = re.sub('fine_tune_checkpoint_type: ".*?"',
                  'fine_tune_checkpoint_type: "{}"'.format(args.fine_tune_checkpoint_type), config)  

  # Set fine_tune_checkpoint path
  config = re.sub('fine_tune_checkpoint: ".*?"',
                  'fine_tune_checkpoint: "{}"'.format(args.fine_tune_checkpoint), config)

  # Set train tf-record file path
  config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/train)(.*?")', 
                  'input_path: "{}"'.format(images_dir_train_record_path), config)

  # Set test tf-record file path
  config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/val)(.*?")', 
                  'input_path: "{}"'.format(images_dir_test_record_path), config)

  # Set number of classes.
  config = re.sub('num_classes: [0-9]+',
                  'num_classes: {}'.format(args.num_classes), config)

  # Set batch size
  config = re.sub('batch_size: [0-9]+',
                  'batch_size: {}'.format(args.batch_size), config)

  # Set training steps
  config = re.sub('num_steps: [0-9]+',
                  'num_steps: {}'.format(int(args.num_steps)), config)

  f.write(config)

# List directory
print("Listing path / dir: ", args.config_directory_output)
arr = os.listdir(args.config_directory_output)
print(arr)

The listing directory of the mounted output is as follows:

Listing path / dir:  /mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
['ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8_steps125000_batch16.config']

BUG: The mounted output directory includes now a file. But the Azure ML Datastore does not include the new written file seen in the Azure Explorer / Azure Portal GUI.

To Reproduce Steps to reproduce the behavior:

Create a new Azure ML Datastore with a new container the storage account
Create pipeline with a job and the output is the new created Azure ML Datastore
Write a file to the output in a pipeline job
Run the pipeline
Confirm that the file is not created in the Azure ML Datastore / Azure Storage Blob Container

Expected behavior Any file written an output Azure ML Datastore in a python-job should be written to the underlying Azure Storage Blob Container and can be used later.

Additional context Using the following tutorials as reference:

https://github.com/Azure/azureml-examples/blob/main/sdk/python/assets/data/data.ipynb -> Reading and writing data in a job

azure-sdk commented 1 year ago

Label prediction was below confidence level 0.6 for Model:ServiceLabels: 'App Configuration:0.13678811,Storage:0.08488751,Compute:0.07487632'

xiangyan99 commented 1 year ago

@azureml-github

mouhannadali commented 1 year ago

I have the same issue, doesn't matter what "path=" I provide in the outputs, it will always mount the output to azureml://datastores/${{default_datastore}}/paths/azureml/${{name}}/${{output_name}}/

luigiw commented 1 year ago

This is not a supported scenario yet. We don't allow customized output path. azure-ai-ml prob should raise the right error message in this scenario.

luigiw commented 1 year ago

@wangchao1230 What do you think of adding a validation in Output class's constructor?

zhengfeiwang commented 1 year ago

Specifing output path during defining component will not work and still use default path azureml://datastores/${{default_datastore}}/paths/azureml/${{name}}/${{output_name}}/

However, specifing output path during component consumption in pipeline is supported with code like below:

# in a pipeline
node = component(<component-args>)
node.outputs.output = Output(
    type="uri_folder", mode="rw_mount", path=custom_path
)

please refer to our sample on this.

lovettchris commented 1 year ago

Hmmm this didn't work for me, I followed the example:

    # example how to change path of output on step level,
    # please note if the output is promoted to pipeline level you need to change path in pipeline job level
    score_with_sample_data.outputs.score_output = Output(
        type="uri_folder", mode="rw_mount", path=custom_path
    )

But my job when submitted shows the datastore for the output is still set to the workspaceblobstore:

In my case the output is a file and so I'm trying to do this:

datastore_path = f"azureml://subscriptions/{subscription}/resourcegroups/{rg}/workspaces/{ws_name}/datastores/nasfacemodels/paths/Deci1"

model_path = f"{datastore_path}/deci_optimized_1.onnx"
dlc_path = f"{datastore_path}/model.dlc"
quant_dlc_path = f"{datastore_path}/model.quant.dlc"

from azure.ai.ml import dsl, Input, Output

@dsl.pipeline(
    compute=snpe_cluster,
    description="Quantization pipeline",
)
def quantization_pipeline(
    pipeline_job_data_input,
    model_input
):
    # using data_prep_function like a python call with its own inputs
    data_prep_job = data_prep_component(
        data=pipeline_job_data_input
    )

    # convert onnx to dlc
    convert_job = convert_component(
        model=model_input
    )

    # for the custom path to work we have to specify it again here, 
    # see https://github.com/Azure/azure-sdk-for-python/issues/27454
    convert_job.outputs.dlc = Output(type="uri_file", path=dlc_path, mode="rw_mount")

    # using train_func like a python call with its own inputs
    quant_job = quant_component(
        data=data_prep_job.outputs.quant_data,
        list_file='input_list.txt',
        model=convert_job.outputs.dlc
    )

    quant_job.outputs.quant_model = Output(type="uri_file", path=quant_dlc_path, mode="rw_mount")

    # a pipeline returns a dictionary of outputs
    # keys will code for the pipeline output identifier
    return {
        "pipeline_job_model": convert_job.outputs.dlc,
        "pipeline_job_quant_model": quant_job.outputs.quant_model
    }

pipeline = quantization_pipeline(
    pipeline_job_data_input=Input(type="uri_file", path=face_data.path),
    model_input=Input(type="uri_file", path=model_path)
)

The only thing that does work and comes from my custom blobstore is the model input Input(type="uri_file", path=face_data.path)

I wish this would work, if not, it looks like I'll have to create my own "save to blobstore" components and inject them into my pipeline which I'd rather not have to do...

D-W- commented 1 year ago

Hi @lovettchris, root cause for this issue is when quant_model got promoted as pipeline level output when returned in @dsl.pipeline. When a node level output got promoted as a pipeline level output. it's node level setting(path, mode) will be overwrite by pipeline level output setting. And when pipeline outputs not configured, our system will fill a default settings for it. So the node output setting got overwrite by the default settings.

We just implemented a fix in SDK side to copy the node output setting to pipeline level which can fix the issue. You can try install the following private build and check if it works:

pip install azure-ai-ml==1.5.0a20230215003 --extra-index-url https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/

Also, I noticed you used ArmID format of datastore:

datastore_path = f"azureml://subscriptions/{subscription}/resourcegroups/{rg}/workspaces/{ws_name}/datastores/nasfacemodels/paths/Deci1"

You may change this to

datastore_path = f"azureml://datastores/nasfacemodels/paths/Deci1"

as @zhengfeiwang suggested.

lovettchris commented 1 year ago

Very cool, thanks Han, Im testing out your fix.

lovettchris commented 1 year ago

It works, thanks and I no longer need to edit the outputs in the pipeline definition, instead the path in the original component output definition is enough:

convert_component = command(
    name="convert",
    display_name="Convert .onnx to .dlc",
    description="Converts the onnx model to dlc format",
    inputs={
        "model": Input(type="uri_file")
    },
    outputs= {
        "dlc": Output(type="uri_file", path=dlc_path, mode="rw_mount")
    },

    # The source folder of the component
    code=scripts_dir,
    command="""python3 convert.py \
            --model ${{inputs.model}} \
            --output ${{outputs.dlc}} \
            """,
    environment=f"{pipeline_job_env.name}:{pipeline_job_env.version}",
)

And this created the output:

which links back to my custom blob store (notice the last modified date on this file is today, which came from this pipeline execution).

Very cool. Now I can run jobs all day that "accumulate" the results I need in a bigger combined blobstore. I did notice one weird thing however, it also created this file in my blobstore, which is incorrect:

It is a zero bytes file, so not sure why it is there. Could this be some weird side effect or bug?

lovettchris commented 1 year ago

Thanks, by the way, regarding the path simplification you showed me:

datastore_path = f"azureml://datastores/nasfacemodels/paths/Deci1"

Originally I got stuck with the v2 api while trying to write something like this:

blobstore = ml_client.datastores.get(name='nasfacemodels') pipeline = quantization_pipeline( pipeline_job_data_input=Input(type="uri_file", path=face_data.path), model_input=Input(type="uri_file", path= blobstore.path / "models" / "Deci2") )

It would be cool if this "just worked", I think I can almost do it with this:

from pathlib import PurePosixPath

model_path = "azureml://" + (PurePosixPath(blobstore.id) / "paths" / "models" / "Deci1")

but that's too complicated. It would be nice if the "Input" class (and Output class) had a more directly discoverable connection to the datastore object.

cloga commented 1 year ago

Hi @lovettchris ,

Currently, this type of concat for path is not supported, we only support pain text, for such advance expression supported it is still in our backlog.

lovettchris commented 1 year ago

Thanks hopefully the api can be improved soon, I found this particularly hard to discover.

You have this api already: "blobstore = ml_client.datastores.get(name='nasfacemodels')"

Just make it usable in the Input and Output path would be great. Or better yet you could add a "store" parameter so I can do this:

Input(type="blob_store", store=ml_client.datastores.get(name='nasfacemodels'), path='models/Deci2')

Then it would be even more clear that you CAN create a connection between pipeline inputs and outputs and azure data stores...

TomSB1423 commented 1 year ago

I am also getting this 0KB ghost file being created following the same custom uri_folder output path provided in the pipeline output construction. Could this be fixed @D-W- / @cloga ?

Below a screenshot of what I am reffering to:

apthagowda97 commented 1 year ago

Hi @D-W- ,

I am still facing this issue in the azure-ai-ml 1.7.2. Any Update on the permanent fix?

amrsharaf commented 1 year ago

creating a pipeline with CLI-v2 and specifying a custom output path still doesn't work, the default output path is always used: azureml://datastores/${{default_datastore}}/paths/azureml/${{name}}/${{output_name}}/

Any plans for fixing this?

amrsharaf commented 1 year ago

Never mind, there was a typo in the component output parameter, would be great if we could through an error in this case

github-actions[bot] commented 11 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.

TheMellyBee commented 9 months ago

My team is also still experiencing this. Was the temporary fix on a private build ever merged in as a public fix?

D-W- commented 9 months ago

Hi @TheMellyBee the issue has been fixed and available in azure-ai-ml>1.5.0. Could you help post your issue here so we can check if it's same issue?

Here's a doc on how to set datastore for outputs for your reference: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-manage-inputs-outputs-pipeline?view=azureml-api-2&tabs=cli

D-W- commented 9 months ago

Hi @apthagowda97, @TomSB1423 and @lovettchris, sorry for the late response. Could you help create a new issue to track the 0KB ghost file thing and we will discuss there? Since it's runtime behavior and seems not related with control plane SDK. we'll involve runtime devs to help investigate. The original output setting issue should be fixed in azure-ai-ml>1.5.0

bluebobbo commented 9 months ago

I'm having a very similar issue, except with batch endpoints. A week ago custom predictions outputs worked for me through the Azure Machine Learning Studio GUI. Now it's giving me an irrelevant error message and won't even let me run the job.

I have verified that the datastore is properly connected as I am able to browse it within the Data assets. It seems like custom outputs do not work unless the output is going to my default workspaceblobstore.

So I tried using my default datastore 'workspaceblobstore' and sure enough it ran but did not accept the custom output path... Here is the run overview, notice there is only Inputs, and no "Outputs" table. It simply defaulted that "azureml//score" path:

Here is what is odd, if I look into the run raw JSON, this is what it looks like... Notice, no "outputDatasets"

However, going back a week, when custom outputs magically worked, this is what the run raw JSON looked like:

Additionally, last week, in the same run, you can see that the run overview has an "Outputs" table:

I even went as far as to try using Python with the latest azure.ai.ml sdk to invoke the batch endpoint, but to no avail. The MLClient.batch_endpoints.invoke() method will certainly run given the input and output, but it will always output the predictions.csv to the workspaceblobstore "azureml//score" default path.

bluebobbo commented 5 months ago

I believe this fixed itself or someone found the issue.

Azure / azure-sdk-for-python

[Azure ML SDK v2] File is not written to output azureml datastore #27454