dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
264 stars 56 forks source link

Image classification training fails when first label contains more than 10000 images #2102

Closed gsuberland closed 2 years ago

gsuberland commented 2 years ago

System Information

Describe the bug After moving an image classification training dataset to a different directory, updating it in the model builder UI, and attempting to retrain it, I ran into the following error being displayed by Visual Studio:

 filePath cannot be null or empty
Parameter name: filePath
   at Microsoft.ML.Runtime.Contracts.CheckNonEmpty(IExceptionContext ctx, String s, String paramName)
   at Microsoft.ML.ModelOperationsCatalog.Load(String filePath, DataViewSchema& inputSchema)
   at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.LocalAutoMLExperiment.<ExecuteAsync>d__15.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalAutoMLExperiment.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__21.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 160 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)

Uninstalling/reinstalling the extensions, clearing temp files, killing processes, rebooting, etc. did not resolve the issue.

I initially thought that this was related to #2046, since the exact same error and stacktrace was displayed, however on further investigation the stacktrace in the log is different to the one provided in that issue.

I investigated by throwing the DLLs into dotPeek and debugging ServiceHub.Host.CLR.x64.exe while it attempted to perform the training. I believe I have discovered the underlying bug.

When the model builder creates a new model, it proposes a set of initial parameters. In Microsoft.ML.ModelBuilder.AutoMLService.Proposer.Controller.Propose() there is a call to HyperParameterProposer.Propose(...), which sets an initial sample size of 10000 based on the initializeSampleSize field. In a dataset where the first label has more then 10000 items (rows), this results in all of the initial sample containing just one label. After this occurs, a value to key mapping is performed via Microsoft.ML.Transforms.ValueToKeyMappingTransformer, which produces only one label. This calls into ImageClassificationTrainer.TrainModelCore, which passes trainContext.TrainingSet.Data to InitializeTrainingGraph. Finally, in InitializeTrainingGraph, the code calls GetKeyCount() to find out how many keys (labels) were identified, then calls Contracts.CheckParam to ensure that it was greater than 1. This fails, throwing an exception. Since this exception occurs in a Task (I think it's part of an async path?) it doesn't end up appearing in the resultant stack trace displayed by VS, but it does appear in the log:

2022-04-19 17:23:01.2188 INFO Trial 0 encounter error with message: Only one class found in the Label column. To build a multiclass classification model, the number of classes needs to be 2 or greater
Parameter name: labelCount (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-19 17:23:01.2288 DEBUG    at Microsoft.ML.Vision.ImageClassificationTrainer.InitializeTrainingGraph(IDataView input)
   at Microsoft.ML.Vision.ImageClassificationTrainer.TrainModelCore(TrainContext trainContext)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.ModelBuilder.AutoMLService.MulticlassClassificationRunner.RunTrainTestSplit(TrialParameter parameter, TrainingConfiguration config, IDataView trainDatasetCache, IDataView testDatasetCache) in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Runner/MulticlassClassificationRunner.cs:line 70
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.Run() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 160
   at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.<RunAsync>d__19.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 169
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.<AddTrialAsync>d__16.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 49 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-19 17:23:01.2288 DEBUG update best config to trial id 0, current best metric is 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-19 17:23:01.2288 DEBUG create new thread 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)

The "filePath cannot be null or empty" exception displayed appears to be a generic result of any part of the model building process failing without producing results, so it is a red herring in terms of identifying the true cause of the issue.

Steps to reproduce A quick and easy way to reproduce this is just to make a new image classification model with a dataset where more than 10000 images are in the first label.

I believe the following pattern leads to this "error after moving training data" behaviour:

  1. An image classification model is created with fewer than 10000 images in the first category.
  2. Training is performed on that dataset at least once.
  3. More images are added to the directory, leading to more than 10000 images in the first category, and the model is retrained. This is successful because the model already contains information about the different labels, so it doesn't care that the initial sample of 10000 only contains one label.
  4. The model's dataset is moved to a different directory, and the path is updated in the model builder UI.
  5. The cached schema in the zip file is invalidated and the mbconfig file is reset to no longer contain any run history. As such, all information about the labels in the dataset is now lost.
  6. Re-training is attempted.
  7. The model has more than 10000 items in its first label, so the first 10000 rows of the TSV only include one label, which results in an error.

I was able to confirm this behaviour and temporarily work around it by creating a copy of my dataset that contained only 5000 items in the first category and 5000 in the second (I only have two categories), running an initial training pass, then moving the full dataset into that location and running the training a second time.

Expected behaviour The data should be sampled in a way that doesn't cause this problem. An ideal approach would be for the sampling to account for label ordering in the TSV; instead of just taking the first 10000 rows, iterate the full row set until all labels are found, then group rows by those labels and ensure that proportionate representation is present in the initial sample set. A quick and dirty alternative would be to just randomly pick rows and rely upon probability to ensure at least two labels are likely to be present in the initial training set.

It would also be helpful if the model builder would handle moving the dataset more gracefully. For example, when selecting a new location for the image data, it could iterate subdirectories in that target directory to identify all labels, then compare that to the cached schema and keep it if they match (or if the cached schema's labels at least fully intersect the new set of labels).

In addition, the error message displayed to the user should really include the stack trace for the training fault rather than the symptom of the training operation not producing any data.

Additional note I don't know enough about ML to be sure, but it seems to me that the "first 10k rows" behaviour may negatively impact the performance of the model. For example, consider a dataset with 9990 items in the first category, and 30000 in the second category. The initial training pass will only see 10 items from the second category despite it being far more populous. I may well be missing some key detail here, but perhaps not, so I thought I'd bring it up anyway. If it is potentially problematic, randomising the sampling seems beneficial.

AlizerUncaged commented 2 years ago

I have the same issue but I have lesser than 4000 images, these are my log files:

2022-04-21 20:03:09.8248 INFO start nni training (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:09.8298 INFO Experiment output folder: D:\New Temp\AutoML-NNI\Experiment-5I38HQ (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:09.8298 DEBUG row count is unknown, count it explicitly (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:09.8298 DEBUG count elapse 1ms (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:09.8298 INFO |     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     | (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:10.0297 DEBUG update train dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\train.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:10.0637 DEBUG update test dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\test.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:13.2553 INFO Trial 0 encounter error with message: Splitter/consolidator worker encountered exception while consuming source data (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:13.2663 DEBUG    at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
   at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_0.<ConsolidateCore>b__1()
   at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.Dispose(Boolean disposing)
   at Microsoft.ML.DataViewRow.Dispose()
   at Microsoft.ML.Vision.ImageClassificationTrainer.CacheFeaturizedImagesToDisk(IDataView input, String labelColumnName, String imageColumnName, ImageProcessor imageProcessor, String inputTensorName, String outputTensorName, String cacheFilePath, Dataset dataset, Action`1 metricsCallback, Nullable`1 validationFraction)
   at Microsoft.ML.Vision.ImageClassificationTrainer.TrainModelCore(TrainContext trainContext)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.ModelBuilder.AutoMLService.MulticlassClassificationRunner.RunTrainTestSplit(TrialParameter parameter, TrainingConfiguration config, IDataView trainDatasetCache, IDataView testDatasetCache) in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Runner/MulticlassClassificationRunner.cs:line 70
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.Run() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 160
   at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.<RunAsync>d__19.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 169
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.<AddTrialAsync>d__16.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 49 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:13.2663 DEBUG update best config to trial id 0, current best metric is 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:13.2663 DEBUG create new thread 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:13.2832 INFO 
===============================================Experiment Results=================================================
------------------------------------------------------------------------------------------------------------------
|                                                     Summary                                                    |
------------------------------------------------------------------------------------------------------------------
|ML Task: ImageClassification                                                                                    |
|Dataset:                                                                                                        |
|Label : Label                                                                                                   |
|Total experiment time : 0.00 Secs                                                                               |
|Total number of models explored: 0                                                                              |
------------------------------------------------------------------------------------------------------------------
 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:13.2832 INFO |                                              Top 0 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
------------------------------------------------------------------------------------------------------------------
 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:13.3062 DEBUG filePath cannot be null or empty
Parameter name: filePath
   at Microsoft.ML.Runtime.Contracts.CheckNonEmpty(IExceptionContext ctx, String s, String paramName)
   at Microsoft.ML.ModelOperationsCatalog.Load(String filePath, DataViewSchema& inputSchema)
   at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.LocalAutoMLExperiment.<ExecuteAsync>d__15.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalAutoMLExperiment.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__21.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 160 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:58.6086 INFO start nni training (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:58.6086 INFO Experiment output folder: D:\New Temp\AutoML-NNI\Experiment-5I38HQ (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:58.6086 DEBUG row count is unknown, count it explicitly (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:58.6086 DEBUG count elapse 1ms (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:58.6086 INFO |     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     | (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:58.6086 DEBUG update train dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\train.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:58.6086 DEBUG update test dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\test.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:59.8690 INFO Trial 0 encounter error with message: Splitter/consolidator worker encountered exception while consuming source data (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:59.8690 DEBUG    at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
   at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_0.<ConsolidateCore>b__1()
   at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.Dispose(Boolean disposing)
   at Microsoft.ML.DataViewRow.Dispose()
   at Microsoft.ML.Vision.ImageClassificationTrainer.CacheFeaturizedImagesToDisk(IDataView input, String labelColumnName, String imageColumnName, ImageProcessor imageProcessor, String inputTensorName, String outputTensorName, String cacheFilePath, Dataset dataset, Action`1 metricsCallback, Nullable`1 validationFraction)
   at Microsoft.ML.Vision.ImageClassificationTrainer.TrainModelCore(TrainContext trainContext)
   at Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.ModelBuilder.AutoMLService.MulticlassClassificationRunner.RunTrainTestSplit(TrialParameter parameter, TrainingConfiguration config, IDataView trainDatasetCache, IDataView testDatasetCache) in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Runner/MulticlassClassificationRunner.cs:line 70
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.Run() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 160
   at System.Threading.Tasks.Task.Execute()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.<RunAsync>d__19.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 169
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.<AddTrialAsync>d__16.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 49 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:59.8690 DEBUG update best config to trial id 0, current best metric is 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:59.8690 DEBUG create new thread 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)
2022-04-21 20:03:59.8690 INFO 
===============================================Experiment Results=================================================
------------------------------------------------------------------------------------------------------------------
|                                                     Summary                                                    |
------------------------------------------------------------------------------------------------------------------
|ML Task: ImageClassification                                                                                    |
|Dataset:                                                                                                        |
|Label : Label                                                                                                   |
|Total experiment time : 0.00 Secs                                                                               |
|Total number of models explored: 0                                                                              |
------------------------------------------------------------------------------------------------------------------
 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:59.8690 INFO |                                              Top 0 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
------------------------------------------------------------------------------------------------------------------
 (Microsoft.ML.ModelBuilder.Utils.Logger.Info)
2022-04-21 20:03:59.8690 DEBUG filePath cannot be null or empty
Parameter name: filePath
   at Microsoft.ML.Runtime.Contracts.CheckNonEmpty(IExceptionContext ctx, String s, String paramName)
   at Microsoft.ML.ModelOperationsCatalog.Load(String filePath, DataViewSchema& inputSchema)
   at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.LocalAutoMLExperiment.<ExecuteAsync>d__15.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalAutoMLExperiment.cs:line 231
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.ML.ModelBuilder.AutoMLEngine.<StartTrainingAsync>d__21.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 160 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug)

The fix: Close vs and clear temp folder.

gsuberland commented 2 years ago

I don't think that matches this bug. There are technically three issues here:

  1. The same exception is shown in the UI for every case where the training fails.
  2. 2046, i.e. something going wrong

    in Microsoft.ML.Vision.ImageClassificationTrainer.CacheFeaturizedImagesToDisk

  3. This issue, which is specifically about the model builder failing to identify labels on the first training run with a large data set because it sequentially takes 10000 items as the first training set.

The first point is what's causing the confusion between which bug is which, and is arguably a UI bug.

The second point is what you're running into.

On Thu, 21 Apr 2022, 13:05 Alizer, @.***> wrote:

I have the same issue but I have lesser than 4000 images, these are my log files: 2022-04-21 20:02:56.4546 INFO Set log file path to D:\New Temp\MLVSTools\logs\Loli-CR8MPK.txt ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:09.8248 INFO start nni training (Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:09.8298 INFO Experiment output folder: D:\New Temp\AutoML-NNI\Experiment-5I38HQ ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:09.8298 DEBUG row count is unknown, count it explicitly (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:09.8298 DEBUG count elapse 1ms (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:09.8298 INFO | Trainer MicroAccuracy MacroAccuracy Duration #Iteration | (Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:10.0297 DEBUG update train dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\train.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:10.0637 DEBUG update test dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\test.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:13.2553 INFO Trial 0 encounter error with message: Splitter/consolidator worker encountered exception while consuming source data ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:13.2663 DEBUG at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes) at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_0.b__1() at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.Dispose(Boolean disposing) at Microsoft.ML.DataViewRow.Dispose() at Microsoft.ML.Vision.ImageClassificationTrainer.CacheFeaturizedImagesToDisk(IDataView input, String labelColumnName, String imageColumnName, ImageProcessor imageProcessor, String inputTensorName, String outputTensorName, String cacheFilePath, Dataset dataset, Action1 metricsCallback, Nullable1 validationFraction) at Microsoft.ML.Vision.ImageClassificationTrainer.TrainModelCore(TrainContext trainContext) at Microsoft.ML.Trainers.TrainerEstimatorBase2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor) at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) at Microsoft.ML.ModelBuilder.AutoMLService.MulticlassClassificationRunner.RunTrainTestSplit(TrialParameter parameter, TrainingConfiguration config, IDataView trainDatasetCache, IDataView testDatasetCache) in / /src/Microsoft.ML.ModelBuilder.AutoMLService/Runner/MulticlassClassificationRunner.cs:line 70 at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.Run() in //src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 160 at System.Threading.Tasks.Task.Execute() --- End of stack trace from previous location where exception was thrown

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.d__19.MoveNext() in /

/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 169 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.d__16.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 49 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:13.2663 DEBUG update best config to trial id 0, current best metric is 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:13.2663 DEBUG create new thread 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:13.2832 INFO ===============================================Experiment Results================================================= | Summary | |ML Task: ImageClassification | |Dataset: | |Label : Label | |Total experiment time : 0.00 Secs | |Total number of models explored: 0 | ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:13.2832 INFO | Top 0 models explored | | Trainer MicroAccuracy MacroAccuracy Duration #Iteration | ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:13.3062 DEBUG filePath cannot be null or empty Parameter name: filePath at Microsoft.ML.Runtime.Contracts.CheckNonEmpty(IExceptionContext ctx, String s, String paramName) at Microsoft.ML.ModelOperationsCatalog.Load(String filePath, DataViewSchema& inputSchema) at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.LocalAutoMLExperiment.d__15.MoveNext() in /

/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalAutoMLExperiment.cs:line 231 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLEngine.d__21.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 160 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:58.6086 INFO start nni training ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:58.6086 INFO Experiment output folder: D:\New Temp\AutoML-NNI\Experiment-5I38HQ ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:58.6086 DEBUG row count is unknown, count it explicitly (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:58.6086 DEBUG count elapse 1ms (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:58.6086 INFO | Trainer MicroAccuracy MacroAccuracy Duration #Iteration | (Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:58.6086 DEBUG update train dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\train.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:58.6086 DEBUG update test dataset D:\New Temp\AutoML-NNI\Experiment-5I38HQ\test.bin (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:59.8690 INFO Trial 0 encounter error with message: Splitter/consolidator worker encountered exception while consuming source data (Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:59.8690 DEBUG at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes) at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_0.b__1() at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.Dispose(Boolean disposing) at Microsoft.ML.DataViewRow.Dispose() at Microsoft.ML.Vision.ImageClassificationTrainer.CacheFeaturizedImagesToDisk(IDataView input, String labelColumnName, String imageColumnName, ImageProcessor imageProcessor, String inputTensorName, String outputTensorName, String cacheFilePath, Dataset dataset, Action1 metricsCallback, Nullable1 validationFraction) at Microsoft.ML.Vision.ImageClassificationTrainer.TrainModelCore(TrainContext trainContext) at Microsoft.ML.Trainers.TrainerEstimatorBase2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.ModelBuilder.AutoMLService.MulticlassClassificationRunner.RunTrainTestSplit(TrialParameter parameter, TrainingConfiguration config, IDataView trainDatasetCache, IDataView testDatasetCache) in / /src/Microsoft.ML.ModelBuilder.AutoMLService/Runner/MulticlassClassificationRunner.cs:line 70 at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.Run() in //src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 160 at System.Threading.Tasks.Task.Execute() --- End of stack trace from previous location where exception was thrown

at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.d__19.MoveNext() in /

/src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 169 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLService.LocalDispatcher.d__16.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/Dispatcher/LocalDispatcher.cs:line 49 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:59.8690 DEBUG update best config to trial id 0, current best metric is 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:59.8690 DEBUG create new thread 0 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) 2022-04-21 20:03:59.8690 INFO ===============================================Experiment Results================================================= | Summary | |ML Task: ImageClassification | |Dataset: | |Label : Label | |Total experiment time : 0.00 Secs | |Total number of models explored: 0 | ( Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:59.8690 INFO | Top 0 models explored | | Trainer MicroAccuracy MacroAccuracy Duration #Iteration |

(Microsoft.ML.ModelBuilder.Utils.Logger.Info) 2022-04-21 20:03:59.8690 DEBUG filePath cannot be null or empty Parameter name: filePath at Microsoft.ML.Runtime.Contracts.CheckNonEmpty(IExceptionContext ctx, String s, String paramName) at Microsoft.ML.ModelOperationsCatalog.Load(String filePath, DataViewSchema& inputSchema) at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.LocalAutoMLExperiment.d__15.MoveNext() in /

/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalAutoMLExperiment.cs:line 231 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLEngine.d__21.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 160 (Microsoft.ML.ModelBuilder.Utils.Logger.Debug) ` I fixed this my making a new solution and start training from there.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2102#issuecomment-1105116883, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRB7DE3EHEJ2DB6D7PWSSDVGE77VANCNFSM5T2CPSQQ . You are receiving this because you authored the thread.Message ID: @.***>

zewditu commented 2 years ago

@LittleLittleCloud could you look at this?

LittleLittleCloud commented 2 years ago

@gsuberland

Thanks for such this detailed investigation, we do have the logic of selecting a subset of dataset to start if the original dataset is too large. However we also shuffle the dataset before selecting subset so it's likely that in your case, even after shuffling the data, there's still only one label in the first 10000 rows of the dataset

So may I ask how's the distribution of labels your dataset looks like, is it an evenly distributed or biased one, and how many pictures does each label has.

One fix that comes into my mind is to sample dataset in a way to avoid producing a sequence that only contain one label in the first 10k. Although the shuffling algorithm we use here is Fisher–Yates so it produces an unbiased permutation, but there's still likely that the first 10k only contains one label if the original dataset is biased.

gsuberland commented 2 years ago

In the dataset that was being used there were about 40000 images total with a roughly 70/30 split across labels, so it seems extremely unlikely that the first set of 10k is missing both labels in a shuffled dataset, let alone doing so a good 30+ times in a row while I was debugging. Intuitively I'd guess that I've got more chance of winning the lottery at the exact same time as being struck by lightning :)

From what I could see it looked like maybe the first 10000 were taken from the dataset before the data was shuffled, but I can't be 100% sure on that one because the call stacks were quite difficult to follow.

If the shuffle is happening in the correct order, perhaps some intermediate step prior to the shuffle is failing to find labels?

The issue should reproducible by just creating 10k copies of the same small image in a directory called "A" and a few more images in a directory called "B". I noticed the ordering is important - if you have 10k items in the lexicographically-first label it fails, otherwise it succeeds, so it seems like it's definitely something in the code that puts labels together.

Another thought: perhaps the 10k sample count is a red herring and it's some other limit in the code that gets set to 10000? Might be worth a quick search the codebase to see if the hyperparameter code is the only place that number appears.

On Mon, 25 Apr 2022, 19:15 Xiaoyun Zhang, @.***> wrote:

@gsuberland https://github.com/gsuberland

Thanks for such this detailed investigation, we do have the logic of selecting a subset of dataset to start if the original dataset is too large. However we also shuffle the dataset before selecting subset so it's likely that in your case, even after shuffling the data, there's still only one label in the first 10000 rows of the dataset

So may I ask how's the distribution of labels your dataset looks like, is it an evenly distributed or biased one, and how many pictures does each label has.

One fix that comes into my mind is to sample dataset in a way to avoid producing a sequence that only contain one label in the first 10k. Although the shuffling algorithm we use here is Fisher–Yates so it produces an unbiased permutation, but there's still likely that the first 10k only contains one label if the original dataset is biased.

— Reply to this email directly, view it on GitHub https://github.com/dotnet/machinelearning-modelbuilder/issues/2102#issuecomment-1108888579, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRB7DACBSU47Z44FPJMV4TVG3ONXANCNFSM5T2CPSQQ . You are receiving this because you were mentioned.Message ID: @.***>

LittleLittleCloud commented 2 years ago

Hmmm that's interesting,,, I tried your way to see if I can reproduce this bug but it trains successfully,, can you share with us the .tsv file as well. It should be in temp folder and should look like this (the first 16 digit might be different)

17ae09b4-408f-4955-adf4-1cac9e157e65-image.ImageTrainData.tsv

LittleLittleCloud commented 2 years ago

In the dataset that was being used there were about 40000 images total with a roughly 70/30 split across labels, so it seems extremely unlikely that the first set of 10k is missing both labels in a shuffled dataset, let alone doing so a good 30+ times in a row while I was debugging. Intuitively I'd guess that I've got more chance of winning the lottery at the exact same time as being struck by lightning :) From what I could see it looked like maybe the first 10000 were taken from the dataset before the data was shuffled, but I can't be 100% sure on that one because the call stacks were quite difficult to follow. If the shuffle is happening in the correct order, perhaps some intermediate step prior to the shuffle is failing to find labels? The issue should reproducible by just creating 10k copies of the same small image in a directory called "A" and a few more images in a directory called "B". I noticed the ordering is important - if you have 10k items in the lexicographically-first label it fails, otherwise it succeeds, so it seems like it's definitely something in the code that puts labels together. Another thought: perhaps the 10k sample count is a red herring and it's some other limit in the code that gets set to 10000? Might be worth a quick search the codebase to see if the hyperparameter code is the only place that number appears. On Mon, 25 Apr 2022, 19:15 Xiaoyun Zhang, @.> wrote: @gsuberland https://github.com/gsuberland Thanks for such this detailed investigation, we do have the logic of selecting a subset of dataset to start if the original dataset is too large. However we also shuffle the dataset before selecting subset so it's likely that in your case, even after shuffling the data, there's still only one label in the first 10000 rows of the dataset So may I ask how's the distribution of labels your dataset looks like, is it an evenly distributed or biased one, and how many pictures does each label has. One fix that comes into my mind is to sample dataset in a way to avoid producing a sequence that only contain one label in the first 10k. Although the shuffling algorithm we use here is Fisher–Yates so it produces an unbiased permutation, but there's still likely that the first 10k only contains one label if the original dataset is biased. — Reply to this email directly, view it on GitHub <#2102 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABRB7DACBSU47Z44FPJMV4TVG3ONXANCNFSM5T2CPSQQ . You are receiving this because you were mentioned.Message ID: @.>

I just take a close look on how shuffle is implemented in ml.net, it seems that unlike what I used to think how shuffle is implemented, it uses a pool to avoid loading entire dataset into memory, which might be the case of this bug.

The size of that pool is defaulted to 1000, and when shuffling, it keeps pulling data randomly from the pool and insert new data in that pool again. So in your dataset, while the first label has ~27000 images, even after shuffling, it's still very likely that the first 25000 images would still have the same label.

We'll fix this bug in the later release, and your work-around should also work, thanks again for helping us investigating this bug!

beccamc commented 2 years ago

@LittleLittleCloud Sounds like the fix should happen on ML.NET side. Is there a bug tracking in that repo?

LittleLittleCloud commented 2 years ago

@LittleLittleCloud Sounds like the fix should happen on ML.NET side. Is there a bug tracking in that repo?

I'd rather say it's a feature in ML.Net (to have a pool for shuffling data). There might be some reason for such design.

@michaelgsharp do you have any comments or information on shuffling data in ml.net

LittleLittleCloud commented 2 years ago

After a few digging, it seems that CacheDataView.CanShuffle is true, which means when shuffling a CacheDataView, ml.net will shuffle both pool and source, which provides more randomness.

beccamc commented 2 years ago

A fix for this has been released! Please test it out and let us know if you see any problems!