[Image Classification API] TensorFlow exception triggered: input ended unexpectedly in the middle of a field

luisquintanilla commented 5 years ago

System information

OS version/distro: Windows 10
.NET Version (eg., dotnet --info): .NET Core 2.2

Issue

What did you do?

Tried to train an image classification DNN model using the Image Classification API on the Intel Image Classification dataset.

What happened?

The following exception was raised

While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.

What did you expect?

The model to train.

Source code / logs

Source Code

public static IEnumerable<ImageInput> LoadImagesFromDirectory(string folder, bool useFolderNameasLabel = true)
{
    var files = Directory.GetFiles(folder, "*",
        searchOption: SearchOption.AllDirectories);

    foreach (var file in files)
    {
        if ((Path.GetExtension(file) != ".jpg") && (Path.GetExtension(file) != ".png"))
            continue;

        var label = Path.GetFileName(file);
        if (useFolderNameasLabel)
            label = Directory.GetParent(file).Name;
        else
        {
            for (int index = 0; index < label.Length; index++)
            {
                if (!char.IsLetter(label[index]))
                {
                    label = label.Substring(0, index);
                    break;
                }
            }
        }

        yield return new ImageInput()
        {
            ImagePath = file,
            Label = label
        };

    }
}

MLContext mlContext = new MLContext();

IEnumerable<ImageInput> train = LoadImagesFromDirectory(trainRelativePath, true).Take(10).ToArray();
IEnumerable<ImageInput> test = LoadImagesFromDirectory(testRelativePath, true).Take(10).ToArray();

IDataView trainSet = mlContext.Data.LoadFromEnumerable(train);
IDataView testSet = mlContext.Data.LoadFromEnumerable(test);

var mapLabelTransform = mlContext.Transforms.Conversion.MapValueToKey
  (outputColumnName: "LabelAsKey",
   inputColumnName: "Label",
   keyOrdinality: ValueToKeyMappingEstimator.KeyOrdinality.ByValue);

var trainingPipeline = 
    mapLabelTransform
   .Append(mlContext.Model.ImageClassification(
       "ImagePath",
       "LabelAsKey",
       arch: ImageClassificationEstimator.Architecture.ResnetV2101,
       epoch: 100,
       batchSize: 150,
       metricsCallback: (metrics) => Console.WriteLine(metrics)));

ITransformer trainedModel = trainingPipeline.Fit(trainSet);

Logs

System.FormatException
  HResult=0x80131537
  Message=Tensorflow exception triggered while loading model.
  Source=Microsoft.ML.Dnn
  StackTrace:
   at Microsoft.ML.Transforms.Dnn.DnnUtils.LoadTFSessionByModelFilePath(IExceptionContext ectx, String modelFile, Boolean metaGraph)
   at Microsoft.ML.DnnCatalog.ImageClassification(ModelOperationsCatalog catalog, String featuresColumnName, String labelColumnName, String scoreColumnName, String predictedLabelColumnName, Architecture arch, Int32 epoch, Int32 batchSize, Single learningRate, ImageClassificationMetricsCallback metricsCallback, Int32 statisticFrequency, DnnFramework framework, String modelSavePath, String finalModelPrefix, IDataView validationSet, Boolean testOnTrainSet, Boolean reuseTrainSetBottleneckCachedValues, Boolean reuseValidationSetBottleneckCachedValues, String trainSetBottleneckCachedValuesFilePath, String validationSetBottleneckCachedValuesFilePath)
   at ImageClassificationAPIMLNETSample.Program.Main(String[] args) in C:\Users\luquinta.REDMOND\source\repos\ImageClassificationAPIMLNETSample\ImageClassificationAPIMLNETSample\Program.cs:line 59

Inner Exception 1:
InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.

Additional output to the console:

Google.Protobuf.InvalidProtocolBufferException: While parsing a protocol message, the input ended unexpectedly in the middle of a field.  This could mean either that the input has been truncated or that an embedded message misreported its own length.
   at Google.Protobuf.CodedInputStream.RefillBuffer(Boolean mustSucceed)
   at Google.Protobuf.CodedInputStream.ReadRawBytes(Int32 size)
   at Google.Protobuf.CodedInputStream.ReadBytes()
   at Tensorflow.TensorProto.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Tensorflow.AttrValue.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Google.Protobuf.FieldCodec.<>c__DisplayClass16_0`1.<ForMessage>b__0(CodedInputStream input)
   at Google.Protobuf.Collections.MapField`2.Codec.MessageAdapter.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Google.Protobuf.Collections.MapField`2.AddEntriesFrom(CodedInputStream input, Codec codec)
   at Tensorflow.NodeDef.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Google.Protobuf.FieldCodec.<>c__DisplayClass16_0`1.<ForMessage>b__0(CodedInputStream input)
   at Google.Protobuf.Collections.RepeatedField`1.AddEntriesFrom(CodedInputStream input, FieldCodec`1 codec)
   at Tensorflow.GraphDef.MergeFrom(CodedInputStream input)
   at Google.Protobuf.CodedInputStream.ReadMessage(IMessage builder)
   at Tensorflow.MetaGraphDef.MergeFrom(CodedInputStream input)
   at Google.Protobuf.MessageExtensions.MergeFrom(IMessage message, Byte[] data)
   at Google.Protobuf.MessageParser`1.ParseFrom(Byte[] data)
   at Tensorflow.saver._import_meta_graph_with_return_elements(String meta_graph_or_file, Boolean clear_devices, String import_scope, String[] return_elements)
   at Microsoft.ML.Transforms.Dnn.DnnUtils.<>c__DisplayClass5_0.<LoadMetaGraph>b__0(Graph graph)
   at Tensorflow.Python.tf_with[TIn,TOut](TIn py, Func`2 action)

luisquintanilla commented 5 years ago

Using this pipeline worked.

var trainingPipeline = 
    mapLabelTransform
   .Append(mlContext.Model.ImageClassification(
       "ImagePath",
       "LabelAsKey",
       arch: ImageClassificationEstimator.Architecture.ResnetV2101,
       epoch: 100,
       batchSize: 30,
       metricsCallback: (metrics) => Console.WriteLine(metrics)));

Switching back to the original code with 150 batch size or another value for that parameter worked as well.

CESARDELATORRE commented 5 years ago

@luisquintanilla - So what exactly was causing the issue then?

zorthgo commented 5 years ago

Any idea as to why that message is being thrown. I tried the same pipeline as you have in your comment but I am still getting that error message.

luisquintanilla commented 5 years ago

@CESARDELATORRE not sure what happened as I was not able to replicate in this instance. I have experienced the issue in other runs but there's nothing I can potentially attribute this to. Re-running the application seems to "fix" it but it's not clear what causes it in the first place so it's difficult to replicate.

CESARDELATORRE commented 5 years ago

It might make sense to hold off a bit on this issue and try the new preview API we're releasing in a few days for Image Classification since it's been evolving significantly.

codemzs commented 5 years ago

@luisquintanilla are you trying to run this in parallel with another instance of this code? Can you please provide a link to your repo with the complete sample so that we can repro it the same as you? Also what version of the nuget you are using?

@ashbhandare is working on this.

luisquintanilla commented 5 years ago

@codemzs I was only running one instance of this code.

Here is the link to the repo

These are the NuGet packages being used.

Package	Version
Microsoft.ML	1.4.0-preview
Microsoft.ML.ImageAnalytics	1.4.0-preview
Microsoft.ML.Dnn	0.16.0-preview

luisquintanilla commented 5 years ago

I think I found a way to reproduce. May be related to what @codemzs mentioned of running multiple instances (although not deliberately). If I run my application and stop it once it initializes, I run into this issue for subsequent runs. Deleting the bin and obj directories and re-running (without stopping) clears the issue and the application trains a model. I suspect in the background, the training continues even though the application has been stopped triggering the issue because multiple instances of the application are running.

ashbhandare commented 5 years ago

I have isolated the source of this error. When you run the ImageClassification pipeline for the first time, the meta graph of the model (ResnetV2101 or InceptionV3) is downloaded, and in the subsequent runs, it is reused. If the run is interrupted while the download is in progress(by stopping), the protobuff is partially downloaded. This throws an error when this incomplete graph is attempted to be read in the subsequent runs. A temporary workaround is to delete the protobuff file and rerun. I'm working on a fix.

dotnet / machinelearning