Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.08k stars 2.52k forks source link

Permission denied (read-only file system) #1677

Open michielva opened 2 years ago

michielva commented 2 years ago

Situation

We are using AzureML pipelines to train neural networks. The raw data used to train the model is saved on our Azure blob storage. Just as the trained models (and the checkpoints while training) are saved here. This has worked perfectly for the last few months, more than 50 pipelines were ran that way.

Issue

Since a few days, we constantly get the same error when trying to train a model. After the first epoch, where the model tries to save the current checkpoint, but permission is denied. ​ Permission seems to be denied because it's a read-only file system. However, none of the access rights seem to imply that this file is any different. All the other files (images, annotations) do not seem to raise any issue.

Even rerunning the exact same pipelines on the exact same data that worked perfectly last week, result in this error.

tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at save_restore_v2_ops.cc:157 : Permission denied: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system ... Epoch 00001: val_loss improved from inf to 0.10273, saving model to /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model Cleaning up all outstanding Run operations, waiting 300.0 seconds 2 items cleaning up... Cleanup took 0.19626808166503906 seconds ... tensorflow.python.framework.errors_impl.PermissionDeniedError: /mnt/azureml/cr/j/0ddbaa5dfd4243c4bd18feabd6037209/cap/data-capability/wd/output_84f84eab_univision_ai/pc_ds_2021_v3/refinement-reg/v2/results/128_c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1/export/saved_model/variables/variables_temp/part-00000-of-00001.data-00000-of-00001.tempstate5255274353572806690; Read-only file system [Op:SaveV2]

What more can I try or check? There were no package version changes, it all stayed the same.

jarandaf commented 2 years ago

Our pipelines also started to fail a few days ago when trying to persist stuff as well (in our case, FileNotFoundError errors for pipeline data references used for writing stuff into our data stores). Same behaviour: published pipelines running smoothly for a long time (no changes at all) and all of a sudden, these errors.

Maybe @lostmygithubaccount can shed some light on the matter? I am afraid something may have changed at file system/execution engine level.

shuyu42 commented 2 years ago

@michielva, @jarandaf could you share a run id for the failed pipeline step? It can be found here from the AML portal: image

jarandaf commented 2 years ago

@shuyums2 sure, there you go:

7f32ffee-3f42-44d9-bba3-a5b9c6e913d5

Thank you.

michielva commented 2 years ago

@shuyums2

c1720fb5-81ed-45aa-a823-a1fa5ef1a8d1

Thanks.

michielva commented 2 years ago

Hi guys,

I just had a support call from someone from Microsoft. Apparently, there's a new runtime version in the background of AzureML. We tried by downgrading the runtime version used and this solved my issues.

If you are coding your pipelines (azureml-sdk), you can add this line after defining your environment. This will force AzureML to use the old runtime. You will also see this in the logfiles which are again the older versions (70_driver_log.txt)

env = Environment.from_dockerfile(...)
env.environment_variables = {"AZUREML_COMPUTE_USE_COMMON_RUNTIME": "false"}

@jarandaf Hope it helps with your issues too!

jarandaf commented 2 years ago

Thank you @michielva, that solved the isssue for us as well!

When is the new runtime expected to be stable, then? This seems to be a temporary work-around @shuyums2.

yikei commented 2 years ago

Hi @jarandaf and @michielva , thanks for reporting these issues, and sorry for the inconvenience caused by these failures!

For the FileNotFoundError that @jarandaf encountered, we found the underlying bug which caused it, and released an urgent hotfix across all regions yesterday. Would you mind re-submitting a job (without the environment variable) to check if the issue is fixed? We highly recommend not disabling the new runtime, because it would prevent new updates and performance improvements from being used by your runs in the future. The new runtime has been rolling out for several months and has been relatively stable, and we appreciate feedback and reports of issues that help us continuously fix and improve.

For the PermissionDeniedError, we are still investigating. I will reply when we have an update!

yikei commented 2 years ago

Apologies for the delay with the PermissionDeniedError issue. This should be resolved now. @michielva - If possible, could you please retry and let us know if you are unblocked? Thanks!