GPU Performance non functional

bmp02050 commented 1 year ago

System Information (please complete the following information):

OS & Version: Windows 11
ML.NET Version: 1.7.0
.NET Version: 6.0

Describe the bug Following all available documentation here (https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/image-classification-api-transfer-learning) and resources available using SciSharp.Tensorflow.Redist-Windows-GPU V2.3.1 to utilize a 2060 TI using the images within the documentation of concrete as a base for training I am getting absolutely useless and terrible predictions and data coming back.

To Reproduce Steps to reproduce the behavior:

Open the solution folder
Run the API
Send a post request in postman to (http://localhost:5267/api/Trainer/buildModels with the path to the top level folder to build the model
Send a Postman request to http://localhost:5267/api/Trainer/predict with a form-data Key: file, Value: image Key: modelPath, Value: path to model folder, (sans the models.zip)

Expected behavior Training results in the following:

Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 0, Accuracy: 0.071183994, Cross-Entropy: 18.97249 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 1, Accuracy: 0.071183994, Cross-Entropy: 15.170444 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 2, Accuracy: 0.071183994, Cross-Entropy: 30.240627 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 3, Accuracy: 0.38297707, Cross-Entropy: 26.751282 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 4, Accuracy: 0.20089968, Cross-Entropy: 34.458427 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 5, Accuracy: 0.38297707, Cross-Entropy: 26.768661 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 6, Accuracy: 0.071183994, Cross-Entropy: 31.264683 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 7, Accuracy: 0.38297707, Cross-Entropy: 16.774061 Phase: Training, Dataset used: Validation, Batch Processed Count: 1124, Epoch: 8, Accuracy: 0.20089968, Cross-Entropy: 23.46656

And predictions lead to a 0 or 1 in one of 6 categories and always comes up the same regardless of what image is sent.

Screenshots, Code, Sample Projects The entire program will be zipped and attached

Additional context Using NVIDIA 10.1 and CUDNN 7.6.4 as required

New Compressed (zipped) Folder.zip

JakeRadMSFT commented 1 year ago

Have you tried Model Builder or the CLI?

I think what's happening here is that the Image you're sending isn't in the right shape somehow. I'd give Model Builder a shot and see if that gets you in the right direction.

I noticed you have a byte[] for the image ... I'd try making it an MLImage.

JakeRadMSFT commented 1 year ago

@luisquintanilla - I suspect this tutorial is outdated after the changes to MLImage but I haven't double checked yet.

ghost commented 1 year ago

This issue has been marked needs-author-action and may be missing some important information.

luisquintanilla commented 1 year ago

Thanks for this issue @bmp02050. Let us know if @JakeRadMSFT suggestion fixed this for you.

bmp02050 commented 1 year ago

I will give this a try and let you know. I'll have to reinstall Cuda 7.6.4 and 10.1 because inwas trying to use tensorflow and python.

bmp02050 commented 1 year ago

/Training was mediocre

Metrics for TensorFlow DNN Transfer Learning multi-class classification model *----------------------------------------------------------- AccuracyMacro = 0.1668, a value between 0 and 1, the closer to 1, the better AccuracyMicro = 0.2563, a value between 0 and 1, the closer to 1, the better LogLoss = 9.4621, the closer to 0, the better LogLoss for class 1 = 16.2276, the closer to 0, the better LogLoss for class 2 = 15.4033, the closer to 0, the better LogLoss for class 3 = 14.9147, the closer to 0, the better LogLoss for class 4 = 12.9011, the closer to 0, the better LogLoss for class 5 = 10.618, the closer to 0, the better LogLoss for class 6 = 1.5642, the closer to 0, the better

Using MLImage with an IFormFile as such:

[HttpPost("predict")]
    public async Task<IActionResult> Predict([FromForm(Name = "file")] IFormFile file,
        [FromForm(Name = "modelPath")] String modelPath)
    {
        try
        {
            var image = new InMemoryImageData()
            {
                Image = MLImage.CreateFromStream(file.OpenReadStream()),
                Label = file.FileName
            };

            var prediction = await Trainer.ClassifySingleImage(image, modelPath);

            return Ok(prediction);
        }
        catch (Exception ex)
        {
            return BadRequest(ex);
        }
    }

Throws an error:

  {
    "ClassName": "System.ArgumentOutOfRangeException",
    "Message": "Could not determine an IDataView type and registered custom types for member Image",
    "Data": {
        "ML_IsMarked": 1
    },
    "InnerException": null,
    "HelpURL": null,
    "StackTraceString": "   at Microsoft.ML.Data.InternalSchemaDefinition.GetVectorAndItemType(String name, Type rawType, IEnumerable`1 attributes, Boolean& isVector, Type& itemType)\r\n   at Microsoft.ML.Data.InternalSchemaDefinition.GetVectorAndItemType(MemberInfo memberInfo, Boolean& isVector, Type& itemType)\r\n   at Microsoft.ML.Data.SchemaDefinition.Create(Type userType, Direction direction)\r\n   at Microsoft.ML.Data.InternalSchemaDefinition.Create(Type userType, Direction direction)\r\n   at Microsoft.ML.Data.DataViewConstructionUtils.CreateInputRow[TRow](IHostEnvironment env, SchemaDefinition schemaDefinition)\r\n   at Microsoft.ML.PredictionEngineBase`2..ctor(IHostEnvironment env, ITransformer transformer, Boolean ignoreMissingColumns, SchemaDefinition inputSchemaDefinition, SchemaDefinition outputSchemaDefinition, Boolean ownsTransformer)\r\n   at Microsoft.ML.PredictionEngine`2..ctor(IHostEnvironment env, ITransformer transformer, Boolean ignoreMissingColumns, SchemaDefinition inputSchemaDefinition, SchemaDefinition outputSchemaDefinition, Boolean ownsTransformer)\r\n   at Microsoft.ML.PredictionEngineExtensions.CreatePredictionEngine[TSrc,TDst](ITransformer transformer, IHostEnvironment env, Boolean ignoreMissingColumns, SchemaDefinition inputSchemaDefinition, SchemaDefinition outputSchemaDefinition, Boolean ownsTransformer)\r\n   at Microsoft.ML.ModelOperationsCatalog.CreatePredictionEngine[TSrc,TDst](ITransformer transformer, Boolean ignoreMissingColumns, SchemaDefinition inputSchemaDefinition, SchemaDefinition outputSchemaDefinition)\r\n   at CardAnalyzer.Trainer.Train.ClassifySingleImage(InMemoryImageData image, String modelPath) in C:\\Users\\bradl\\source\\repos\\CardAnalyzer\\CardAnalyzer.Trainer\\Train.cs:line 137\r\n   at CardAnalyzer.API.Controllers.TrainerController.Predict(IFormFile file, String modelPath) in C:\\Users\\bradl\\source\\repos\\CardAnalyzer\\CardAnalyzer.API\\Controllers\\TrainerController.cs:line 39",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "ExceptionMethod": null,
    "HResult": -2146233086,
    "Source": "Microsoft.ML.Data",
    "WatsonBuckets": null,
    "ParamName": "rawType",
    "ActualValue": null
}

bmp02050 commented 1 year ago

@luisquintanilla @JakeRadMSFT This suggestion didn't work.

bmp02050 commented 1 year ago

I'm assuming at this point then that I should move to python...

JakeRadMSFT commented 1 year ago

@LittleLittleCloud thoughts?

LittleLittleCloud commented 1 year ago

MLImage is introduced after v2.0.0, so @bmp02050 maybe you can try updating ml.net version and register ImageType in your InMemoryImageData?

Something like

In the meantime, are you also using an RTX 2060 card? We have known issue that loss doesn't goes down on GPU training over rtx 3080 card. Maybe rtx 2060 also have such problem?

bmp02050 commented 1 year ago

I'll give this a whirl!

dotnet / machinelearning

GPU Performance non functional #6657