Closed luisquintanilla closed 5 years ago
Using this pipeline worked.
var trainingPipeline =
mapLabelTransform
.Append(mlContext.Model.ImageClassification(
"ImagePath",
"LabelAsKey",
arch: ImageClassificationEstimator.Architecture.ResnetV2101,
epoch: 100,
batchSize: 30,
metricsCallback: (metrics) => Console.WriteLine(metrics)));
Switching back to the original code with 150 batch size or another value for that parameter worked as well.
@luisquintanilla - So what exactly was causing the issue then?
Any idea as to why that message is being thrown. I tried the same pipeline as you have in your comment but I am still getting that error message.
@CESARDELATORRE not sure what happened as I was not able to replicate in this instance. I have experienced the issue in other runs but there's nothing I can potentially attribute this to. Re-running the application seems to "fix" it but it's not clear what causes it in the first place so it's difficult to replicate.
It might make sense to hold off a bit on this issue and try the new preview API we're releasing in a few days for Image Classification since it's been evolving significantly.
@luisquintanilla are you trying to run this in parallel with another instance of this code? Can you please provide a link to your repo with the complete sample so that we can repro it the same as you? Also what version of the nuget you are using?
@ashbhandare is working on this.
@codemzs I was only running one instance of this code.
Here is the link to the repo
These are the NuGet packages being used.
Package | Version |
---|---|
Microsoft.ML | 1.4.0-preview |
Microsoft.ML.ImageAnalytics | 1.4.0-preview |
Microsoft.ML.Dnn | 0.16.0-preview |
I think I found a way to reproduce. May be related to what @codemzs mentioned of running multiple instances (although not deliberately). If I run my application and stop it once it initializes, I run into this issue for subsequent runs. Deleting the bin
and obj
directories and re-running (without stopping) clears the issue and the application trains a model. I suspect in the background, the training continues even though the application has been stopped triggering the issue because multiple instances of the application are running.
I have isolated the source of this error. When you run the ImageClassification pipeline for the first time, the meta graph of the model (ResnetV2101 or InceptionV3) is downloaded, and in the subsequent runs, it is reused. If the run is interrupted while the download is in progress(by stopping), the protobuff is partially downloaded. This throws an error when this incomplete graph is attempted to be read in the subsequent runs. A temporary workaround is to delete the protobuff file and rerun. I'm working on a fix.
System information
Issue
Tried to train an image classification DNN model using the Image Classification API on the Intel Image Classification dataset.
The following exception was raised
The model to train.
Source code / logs
Source Code
Logs
Additional output to the console: