How to resume training on low priority VMs?

johan12345 commented 3 years ago

I am running an Azure ML pipeline for Machine Learning training on a low priority compute cluster. So, occasionally, the VM will be preempted and restarted at a later time. In this case, I want to resume training from where the VM was stopped by loading the last model I saved in the outputs directory.

This use case is also mentioned in the docs:

In general, we recommend using Low-Priority VMs for Batch workloads. You should also use them where interruptions are recoverable either through resubmits (for Batch Inferencing) or through restarts (for deep learning training with checkpointing).

However, it seems that while the saved models stored in the outputs directory are still shown in the Azure ML web interface after the VM restarts, my training script can not find them in that directory. Are these files not downloaded before the script is restarted? Which directory can I use instead to store these files?

johan12345 commented 3 years ago

Hey, any updates on this question?

rndazurescript commented 2 years ago

Outputs is something specific to a run. I doubt if they are automatically re-downloaded. What I do is store the state in a mounted storage account and when the job restarts, it reads the checkpoints from there and continues.

johan12345 commented 2 years ago

Thanks - yes, using a mounted storage account and a directory name unique for the pipeline run (e.g. generated in the script that starts the pipeline) seems to work.

Would be great if it were made more clear in the documentation though what data is kept in the case of preemption and what is removed.

daoterog commented 4 months ago

I've been storing my model checkpoints in a storage account container which I pass as a command job output and whenever the job is preempted it gives me a file not found error as if it no longer had access to the output I specified. The output mode is read/write mount.

Recently, I've been getting this warning as I send out jobs:

pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.MLFlowModelJobOutput'> and will be ignored pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored pathOnCompute is not a known attribute of class <class 'azure.ai.ml._restclient.v2023_04_01_preview.models._models_py3.UriFolderJobOutput'> and will be ignored

Could it be something related to this? I tried looking at the code but didn't find anything related to this. How can I maintain access to my output paths so my run doesn't fail whenever it is preempted?

Azure / MachineLearningNotebooks

How to resume training on low priority VMs? #1575