Open johan12345 opened 3 years ago
Hey, any updates on this question?
Outputs is something specific to a run. I doubt if they are automatically re-downloaded. What I do is store the state in a mounted storage account and when the job restarts, it reads the checkpoints from there and continues.
Thanks - yes, using a mounted storage account and a directory name unique for the pipeline run (e.g. generated in the script that starts the pipeline) seems to work.
Would be great if it were made more clear in the documentation though what data is kept in the case of preemption and what is removed.
I've been storing my model checkpoints in a storage account container which I pass as a command job output and whenever the job is preempted it gives me a file not found error as if it no longer had access to the output I specified. The output mode is read/write mount.
Recently, I've been getting this warning as I send out jobs:
pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.MLFlowModelJobOutput'> and will be ignored pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored
Could it be something related to this? I tried looking at the code but didn't find anything related to this. How can I maintain access to my output paths so my run doesn't fail whenever it is preempted?
I am running an Azure ML pipeline for Machine Learning training on a low priority compute cluster. So, occasionally, the VM will be preempted and restarted at a later time. In this case, I want to resume training from where the VM was stopped by loading the last model I saved in the
outputs
directory.This use case is also mentioned in the docs:
However, it seems that while the saved models stored in the
outputs
directory are still shown in the Azure ML web interface after the VM restarts, my training script can not find them in that directory. Are these files not downloaded before the script is restarted? Which directory can I use instead to store these files?