dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
263 stars 56 forks source link

Azure Image CodeGen Failed (Low Disk Space) #832

Open justinormont opened 4 years ago

justinormont commented 4 years ago

System Information (please complete the following information):

Describe the bug If disk space is too low, CodeGen fails. We should catch this error, and present an actionable error.

Ideally, we would have a retry button if CodeGen fails, which would restart the download of the Azure image model and rerun CodeGen. Otherwise the user has to restart the entire ML training process.

Stack:

   at AzureML.AutoMLRunnerImages.<RunAutoMLAsync>d__27.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/RemoteAutoML/AutoMLRunnerImages.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AzureImageClassificationExperiment.<ExecuteAsync>d__13.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/AzureImageClassificationExperiment.cs:line 69
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__30.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 147

To Reproduce After running Azure AutoML multiclass image classification, Model Builder stopped during CodeGen (or during model download). I checked free disk space and had 40MB free. I expect the issue was a failure due to the low disk space; though could be another cause.

Expected behavior I would expect a disk error message. Or a failure to write file. The current error message is rather generic and isn't actionable.

As mentioned above, ideally, we would have a retry button, which would restart the Azure model download and CodeGen. Otherwise the user has to restart the entire ML training process.

Screenshots

image

Additional context

The diskspace of the VS project folders for Azure AutoML image models adds up more quickly than I was expecting. Since I'm benchmarking across datasets and often multiple runs within a dataset, the size of each Model Builder created folder became important.

After compiling, 5 copies of bestModel.onnx, and 5 copies of MLModel.zip are present. In all a single VS project folder came out to 1.7GB: image

This is then multiplied by the number of runs/datasets I'm benchmarking, causing disk space issues for me.

End of log file:

2020-06-12 15:17:40.9359 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:43.5310 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:43.5310 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:43.5310 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:46.6321 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:46.6321 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:46.6321 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:49.0821 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:49.1091 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:49.1091 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:51.1243 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:51.1243 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:51.1313 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:53.2291 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:53.2461 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:53.2581 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:55.2968 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:55.3038 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:55.3038 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:57.3953 INFO Running AutoML pipeline sweep... (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:57.3953 INFO Child run status:Completed: 1, Canceled: 1 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:17:57.3953 INFO  (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4038 INFO Completed a training run (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4198 INFO Downloading best model to C:\Users\jormont\AppData\Local\Temp\MLVSTools\NIHMalariaCellClassificationML\NIHMalariaCellClassificationML.Model\bestModel.onnx.. (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4198 INFO Object reference not set to an instance of an object. (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2020-06-12 15:18:01.4478 DEBUG Object reference not set to an instance of an object.
   at AzureML.AutoMLRunnerImages.<RunAutoMLAsync>d__27.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/RemoteAutoML/AutoMLRunnerImages.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AzureImageClassificationExperiment.<ExecuteAsync>d__13.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/AzureImageClassificationExperiment.cs:line 69
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__30.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 147 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2020-06-14 13:57:11.5244 DEBUG Open Log FileC:\Users\jormont\AppData\Local\Temp\MLVSTools\logs\174e3458-b101-43b2-a0fb-622efc138c12.txt (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
JakeRadMSFT commented 4 years ago

Duplicate of #320

JakeRadMSFT commented 4 years ago

Not exact duplicate ... reopening.

beccamc commented 2 years ago

@JakeRadMSFT This is from two years ago. Can we close?

justinormont commented 2 years ago

This is old, but AFAIK still valid. So depends on how you'd like to manage old bug reports / issues.