dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.93k stars 1.86k forks source link

Image Classification does not evaluate data #6876

Closed gktval closed 8 months ago

gktval commented 8 months ago

System Information (please complete the following information):

Describe the bug Running an image classification, the training runs through very quickly (11 sec), the Accuracy and Cross-Entropy for each epoch is Nan. Is there a reason that the data did not train or a log where I can see problems?

Training another dataset very similar runs fine, but this particular dataset does not evaluate any of the data. Perhaps there is a minimum width/height the images can be for training? The images are varying size, but they should all have a w/h > 40px.

Below is a screenshot of the data: image

Also, here is a snippet of the training: restore "C:\PROGRAM FILES\MICROSOFT VISUAL STUDIO\2022\PROFESSIONAL\COMMON7\IDE\COMMONEXTENSIONS\MICROSOFT\MODELBUILDER\AUTOMLSERVICE\RuntimeManager\tensorflow.gpu.csproj" --configfile "C:\PROGRAM FILES\MICROSOFT VISUAL STUDIO\2022\PROFESSIONAL\COMMON7\IDE\COMMONEXTENSIONS\MICROSOFT\MODELBUILDER\AUTOMLSERVICE\RuntimeManager\NuGet.config" -r win-x64 /p:UsingToolXliff=false /p:TorchSharpVersion=0.98.3 /p:TorchSharpCudaRuntimeVersion=1.11.0.1 /p:TensorflowRuntimeVersion=2.3.1 /p:BaseIntermediateOutputPath="C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1\obj" publish "C:\PROGRAM FILES\MICROSOFT VISUAL STUDIO\2022\PROFESSIONAL\COMMON7\IDE\COMMONEXTENSIONS\MICROSOFT\MODELBUILDER\AUTOMLSERVICE\RuntimeManager\tensorflow.gpu.csproj" -r win-x64 -c Release --no-self-contained -o C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1 --no-restore /p:UsingToolXliff=false /p:TorchSharpVersion=0.98.3 /p:TorchSharpCudaRuntimeVersion=1.11.0.1 /p:TensorflowRuntimeVersion=2.3.1 /p:BaseOutputPath="C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1\bin\" /p:BaseIntermediateOutputPath="C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1\obj\" start installing runtime in C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1 Determining projects to restore... All projects are up-to-date for restore. MSBuild version 17.8.0+6cdef4241 for .NET C:\Program Files\dotnet\sdk\8.0.100-rc.2.23502.2\Sdks\Microsoft.NET.Sdk\targets\Microsoft.NET.RuntimeIdentifierInference.targets(311,5): message NETSDK1057: You are using a preview version of .NET. See: https://aka.ms/dotnet-support-policy [C:\PROGRAM FILES\MICROSOFT VISUAL STUDIO\2022\PROFESSIONAL\COMMON7\IDE\COMMONEXTENSIONS\MICROSOFT\MODELBUILDER\AUTOMLSERVICE\RuntimeManager\tensorflow.gpu.csproj] tensorflow.gpu -> C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1\bin\Release\netstandard2.0\win-x64\tensorflow.gpu.dll tensorflow.gpu -> C:\Users\user\AppData\Local\Temp\ModelBuilder\tensorflow-cuda.2.3.1\ install runtime successfully [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Channel started [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; Ensuring meta files are present., Kind=Trace] Channel started [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; Ensuring meta files are present., Kind=Trace] Channel finished. Elapsed 00:00:00.0037993. [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; Ensuring meta files are present., Kind=Trace] Channel disposed [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 0, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 0, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 1, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 1, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 2, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 2, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 3, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 3, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 4, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 4, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 5, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 5, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 6, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 6, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 7, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 7, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 8, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01 [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Validation, Batch Processed Count: 0, Epoch: 8, Accuracy: NaN, Cross-Entropy: NaN [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ImageClassificationTrainer; ImageClassificationTrainer, Kind=Trace] Phase: Training, Dataset used: Train, Batch Processed Count: 0, Epoch: 9, Accuracy: NaN, Cross-Entropy: NaN, Learning Rate: 0.01

To Reproduce Steps to reproduce the behavior:

  1. Image Training
  2. Load the data into the model
  3. Train with Cuda
  4. Notice that the data is not evaluated

Expected behavior Accuracy and cross entropy to not be Nan

gktval commented 8 months ago

I figured out what the problem was. I had accidentally save the .png files with a .tif encoding. So even though the images loaded into the image classification and they displayed in the overview, it did not run in the model. Changing it so that the file extension (.png) matched the .png encoding made the training run properly.

It would be nice if .tif files were an allowed extension in the image classification...