AML Environments are not fully-reproducible

dataders commented 4 years ago

SDK version: 1.7.0 region: West US II

We had the below snippet unchanged in our codebase since at least 9/16/2019. It is a train.py script that is used as part of a HyperDriveStep. Fortunately, we never referenced this 'c:\\\\temp\\output' folder, but, unfortunately, this undocumented change in service behavior has brought all of our pipelines for this workstream (including scheduled production PIpelines to a grinding halt).

As of August 3 7:56 AM PST, we could call the beloq snippet in a ScriptRun in AMLCompute without issue. Less than 24 hours later, on Aug 4, 2020 7:30 AM PST, this snippet throws the following error.

import os
os.makedirs('c:\\\\temp\\output', exist_ok=True)

Traceback (most recent call last):
  File "train.py", line 254, in <module>
    os.makedirs(args.output_dir, exist_ok=True)
  File "/azureml-envs/azureml_b645cefe04fae03d5517fcc901ff4606/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'c:\\\\temp\\output'

Here is a notebook that demonstrates reproducible example. I've simplified it to be just a ScriptRunConfig, rather than a PIpelineRun with a HyperDriveStep

working run id: HD_6b2da7a7-5dfa-4e62-b529-00c802d4307f_4 August 3 7:56 AM PST failing run id: HD_7c714c83-0670-480f-bc64-e7f8c76237c6_3 August 4, Aug 5, 2020 7:30 AM

both runs were in the same experiment in our production workspace, and even used the same pre-built Docker image from ACI, avaprditsmlsvc8080365804.azurecr.io/azureml/azureml_60fa7a8e306e9ff54f8235b956576310

sonnypark commented 4 years ago

We are investigating this issue. Will keep you updated.

dataders commented 4 years ago

what linux distro does AMLCompute use? My guess is there was a distro update.

sonnypark commented 4 years ago

Yes that's what we are checking with compute team. BTW, thx for the reproducible example. But I guess below wouldn't ever worked in the linux compute node? os.makedirs('c:\\temp\output', exist_ok=True) And I confirmed we still able to create dir like '/tmp\dir1'. So I assume what you have in your script is more like relative paths with current directory which possibly mounted blob path instead of any local paths like /tmp/... ?

dataders commented 4 years ago

os.makedirs('c:\\temp\output', exist_ok=True) worked every day from Sep 2019 (the day that line got introduced) until this Tuesday. This behavior wasn't intentional (obvs misguided).

Our issue is that when we publish a pipeline, we expect it to run in the exact same way on the same environment (including distro). If the code worked for 9 months and we don't change anything, it shouldn't break (without at least a heads up).

basically, we had a script arg, output_dir with a default argument of ''c:\\\\temp\\output''. Normally we pass PipelineData to script arguments when running pipelines and the default value was only meant for running train.py directly on our local machines for debugging.

However, 9 months ago, we removed --output_dir from HyperDriveStep's estimator_entry_script_arguments. But the pipeline worked for NINE MONTHS. If you look at the 70_driver_logs.txt from the two run_ids above. The code worked on August 3 but not on August 4.

So

`train.py` (heavily abbreviated)

if __name__ == "__main__":
    # parameters
    parser = argparse.ArgumentParser()
    parser.add_argument('--output_dir', dest="input_dir",
                        default='c:\\\\temp\\output')
    args = parser.parse_args()
    print("all args: ")
    pprint(vars(args))

    clf, metrics, lift = main(data, params_dict)

    os.makedirs('outputs', exist_ok=True)
    os.makedirs(args.output_dir, exist_ok=True)
    joblib.dump(value=clf, filename=os.path.join('outputs', args.model_name + '.pkl'))
    joblib.dump(value=clf, filename=os.path.join(args.output_dir, args.model_name + '.pkl'))

`pipeline.py` excerpt

split_data = PipelineData('split_data', datastore=ds_pipeline)
model_data = PipelineData('model_data', datastore=ds_pipeline)

est_config_aml = Estimator(
    source_directory="./compute/train",
    entry_script="train.py",
    compute_target=compute_target,
    environment_definition=run_config.environment
)

# Tuning to AutoML
random_sampling = RandomParameterSampling({
    # 'boosting_type' : choice('gbdt', 'goss'),
    'learning_rate': quniform(0.02, 0.1, 0.01),
    'num_leaves': quniform(25, 250, 1),
    "max_bin": quniform(50, 400, 10),
    "min_child_samples": quniform(1, 1000, 25),
    "colsample_bytree": quniform(0.2, 1, 0.1),
    "subsample": quniform(0.1, 1, 0.1),
    "n_estimators": quniform(50, 800, 50),
    "max_depth": quniform(2, 8, 1)
})

hyperdrive_run_config = HyperDriveConfig(
    estimator=est_config_aml,  # AML
    hyperparameter_sampling=random_sampling,
    primary_metric_name="geometric mean",
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=hyperdrive_runs,
    max_concurrent_runs=8)

hyperdrive_step = [
    HyperDriveStep(
        name='kickoff hyperdrive jobs',
        hyperdrive_config=hyperdrive_run_config,
        estimator_entry_script_arguments=[
                                            "--input_dir", split_data,
                                            # '--output_dir', model_dir, DELETED THIS LINE 9 MONTHS AGO
                                            '--model_name', 'deal_model',
                                            '--data_name', 'deal_data'],
        inputs=[split_data],
        metrics_output=hyperdrive_json,
        allow_reuse=pipeline_reuse
    )

sonnypark commented 4 years ago

Thx for the details. Now I got it. We just confirmed we have regression on blobfuse update and was able to confirm below code worked with the old version of it: os.makedirs('c:\\\\temp\\output', exist_ok=True) what it actually does: creating directory with the name c:/temp/output in your current working dir. So full path is looking like: /mounted_path/c:/temp/output We are working on reverting the blobfuse version.

dataders commented 4 years ago

Our issue isn't that we can't do this anymore: os.makedirs('c:\\\\temp\\output', exist_ok=True). We already pushed a hotfix to remove the code causing the error, so reverting isn't necessary. Our issue has always been that blobfuse versions (and perhaps other packages) can change on a published pipeline without any advance warning. A solution would be to guarantee that changes like this can't be introduced in the future. In the case of blobfuse it seems like will need to be pinned to a tag or commit hash?

sonnypark commented 4 years ago

There was a regression from the storage team specifically when paths have a “\” that we had to work with them on rolling back. They are making a fix to handle this scenario, but we are also hardening our system to test this specific use case before releasing.

dataders commented 4 years ago

We don't care about '\''s in paths. This was just a symptom of the larger problem: that Environments of Published Pipeline are not fully guaranteed to stay the same even when the Environment definition is the same! @rastala

vizhur commented 4 years ago

@swanderz with the same environment definition AzureML guarantees reproducibility, unless

derived image was deleted from Workspace ACR
base_image:tag was updated

For the second case consider to use base_image:@sha256=digest or at least version tag if AzureML base image is used as a base image

If client was updated, new version can have new version as a default base image, so if environment got instantiated for the run and not retrieved from workspace (env = Environment.get(workspace, "myenv")) it will reference the latest version of the base image

dataders commented 4 years ago

Looked like a drafted a response yesterday and never actually posted it...

The derived image wasn't deleted from ACR, you can see that in both run instances the image GUID was the same.

My understanding is that blobfuse (a.k.a. azure-storage-fuse) is not included in the azureml_base image ( I can't find it in the Dockerfile.

I'd love to be wrong though. My thinking right now that pinning docker_base_image to a specific Docker tag or commit, still wouldn't have prevented the problem, right?

vizhur commented 4 years ago

Sorry, I don't quite understand the problem.

You are correct that azure-storage-fuse is not included into the dockerfile/base image. We don't have a python layer on the base images. That gets created from the Encironment object from the conda dependencies. You will use cached image until you have it in ACR and Environment object is the same. If you want to update to the latest version of any unpinned dependency you can either do Environment("new_environment").register(ws).build(ws), delete an image or simply pin necessary dependency.

Please provide more details if I misinterpreted your problem

dataders commented 4 years ago

@vizhur , I appreciate you hanging in here with me.

Our problem is that we expected the same environment that we had been using in a Published Pipeline to stay the same. Unfortunately, blobfuse was upgraded and it broke our pipeline. If blobfuse isn’t in the docker image, and it isn’t a pin-able python package, how can we harden our environment to prevent future breaking changes introduced by blobfuse (and perhaps other similarly integrated packages)?

vizhur commented 4 years ago

as I can see from the sample you shared, you did hit the cache as there is no build step. so the environment itself is reproducible. so the image itself is fine, but that was a service side regression as far as I understand

dataders commented 3 years ago

closing this issue, as I just heard from the support engineer that:

the blobfuse issue was a broken scenario, and blobfuse team already had a related GH (https://github.com/Azure/azure-storage-fuse/issues/438) Also our PM team have an item on blobfuse team to make sure this particular scenario is well tested before release.

Azure / MachineLearningNotebooks