System Information (please complete the following information):
Model Builder Version: Nightly from last week
Visual Studio Version: VS 2019
Describe the bug
If disk space is too low, CodeGen fails. We should catch this error, and present an actionable error.
Ideally, we would have a retry button if CodeGen fails, which would restart the download of the Azure image model and rerun CodeGen. Otherwise the user has to restart the entire ML training process.
Stack:
at AzureML.AutoMLRunnerImages.<RunAutoMLAsync>d__27.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/RemoteAutoML/AutoMLRunnerImages.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AzureImageClassificationExperiment.<ExecuteAsync>d__13.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/AzureImageClassificationExperiment.cs:line 69
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__30.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 147
To Reproduce
After running Azure AutoML multiclass image classification, Model Builder stopped during CodeGen (or during model download). I checked free disk space and had 40MB free. I expect the issue was a failure due to the low disk space; though could be another cause.
Expected behavior
I would expect a disk error message. Or a failure to write file. The current error message is rather generic and isn't actionable.
As mentioned above, ideally, we would have a retry button, which would restart the Azure model download and CodeGen. Otherwise the user has to restart the entire ML training process.
Screenshots
Additional context
The diskspace of the VS project folders for Azure AutoML image models adds up more quickly than I was expecting. Since I'm benchmarking across datasets and often multiple runs within a dataset, the size of each Model Builder created folder became important.
After compiling, 5 copies of bestModel.onnx, and 5 copies of MLModel.zip are present. In all a single VS project folder came out to 1.7GB:
This is then multiplied by the number of runs/datasets I'm benchmarking, causing disk space issues for me.
End of log file:
2020-06-12 15:17:40.9359 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:43.5310 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:43.5310 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:43.5310 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:46.6321 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:46.6321 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:46.6321 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:49.0821 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:49.1091 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:49.1091 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:51.1243 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:51.1243 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:51.1313 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:53.2291 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:53.2461 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:53.2581 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:55.2968 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:55.3038 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:55.3038 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:57.3953 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:57.3953 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:57.3953 INFO (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4038 INFO Completed a training run (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4198 INFO Downloading best model to C:\Users\jormont\AppData\Local\Temp\MLVSTools\NIHMalariaCellClassificationML\NIHMalariaCellClassificationML.Model\bestModel.onnx.. (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4198 INFO Object reference not set to an instance of an object. (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4478 DEBUG Object reference not set to an instance of an object.
at AzureML.AutoMLRunnerImages.<RunAutoMLAsync>d__27.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/RemoteAutoML/AutoMLRunnerImages.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AzureImageClassificationExperiment.<ExecuteAsync>d__13.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/AzureImageClassificationExperiment.cs:line 69
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__30.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 147 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2020-06-14 13:57:11.5244 DEBUG Open Log FileC:\Users\jormont\AppData\Local\Temp\MLVSTools\logs\174e3458-b101-43b2-a0fb-622efc138c12.txt (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
System Information (please complete the following information):
Describe the bug If disk space is too low, CodeGen fails. We should catch this error, and present an actionable error.
Ideally, we would have a retry button if CodeGen fails, which would restart the download of the Azure image model and rerun CodeGen. Otherwise the user has to restart the entire ML training process.
Stack:
To Reproduce After running Azure AutoML multiclass image classification, Model Builder stopped during CodeGen (or during model download). I checked free disk space and had 40MB free. I expect the issue was a failure due to the low disk space; though could be another cause.
Expected behavior I would expect a disk error message. Or a failure to write file. The current error message is rather generic and isn't actionable.
As mentioned above, ideally, we would have a retry button, which would restart the Azure model download and CodeGen. Otherwise the user has to restart the entire ML training process.
Screenshots
Additional context
The diskspace of the VS project folders for Azure AutoML image models adds up more quickly than I was expecting. Since I'm benchmarking across datasets and often multiple runs within a dataset, the size of each Model Builder created folder became important.
After compiling, 5 copies of
bestModel.onnx
, and 5 copies ofMLModel.zip
are present. In all a single VS project folder came out to 1.7GB:This is then multiplied by the number of runs/datasets I'm benchmarking, causing disk space issues for me.
End of log file: