dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

[Image Classification DNN based] Cannot use GPU with NuGet 0.16.0-preview2 #4325

Closed CESARDELATORRE closed 4 years ago

CESARDELATORRE commented 4 years ago

The NuGet package 'Microsoft.ML.Dnn 0.16.0-preview2' is including a dependency on the CPU-based SciSharp.TensorFlow.Redist package:

image

Therefore the user cannot reference and use the SciSharp.TensorFlow.Redist-Windows-GPU package because the CPU version has preference, afaik.

The NuGet package Microsoft.ML.Dnn 0.16.0-preview2 must not reference any of those, so depending on what SciSharp TensorFlow redist (CPU vs. GPU) the user is referencing from his code, it'll use CPU or GPU.

I know this is being fixed in the ML.NET source code repo with this PR after my heads-up: https://github.com/dotnet/machinelearning/pull/4324

But users mostly use the NuGet packages so we probably need to publish a new fix-patch-release for that package like the following?:

Microsoft.ML.Dnn 0.16.1-preview2 ?

Any other solution so users can try Image Classification with GPU when using the NuGet packages?

codemzs commented 4 years ago

@CESARDELATORRE myself and @eerhardt spoke about this and we feel there might be a workaround that people can use. Basically you also add GPU TF redist and then manually go delete the binary for CPU. @bpstark was going to verify this works so he can update this thread. I also feel we should wait until someone hits this issue before we consider a patch release ... I want to see how many people that are actually trying this preview have GPUs. Just looking at the nuget downloads I don't see very many people trying out "preview" nugets so I want to be little cautious with publishing a patch so quickly unless there is a real need for it and a work around does not exist.

bpstark commented 4 years ago

Tested work-around, and confirmed we can run with GPU TF for now steps for work-around:

  1. Add the TF GPU dependency to your project just as would be required originally.
  2. Locate your Nuget package cache it should be something like C:\Users\\.nuget\packages
  3. Copy \scisharp.tensorflow.redist\1.14.0\runtimes\win-x64\native\tensorflow.dll to your desktop or somewhere for safe keeping
  4. Overwrite the previously mentioned DLL by copying the GPU version from \scisharp.tensorflow.redist-windows-gpu\1.14.0\runtimes\win-x64\native\tensorflow.dll to \scisharp.tensorflow.redist\1.14.0\runtimes\win-x64\native\tensorflow.dll

Optional: If you want to reset back to the CPU version move the DLL you had saved to the desktop back to the original location as specified in step 3

CESARDELATORRE commented 4 years ago

Reopening as that workaround is not working for me. I might be missing something else?

Initially, I had my sample working properly with CPU, then I followed the procedure, and when I try to run it, I get this exception:

Unhandled Exception: System.FormatException: Tensorflow exception triggered while loading model. ---> System.DllNotFoundException: Unable to load DLL 'tensorflow' or one of its dependencies: The specified module could not be found. (Exception from HRESULT: 0x8007007E)
   at Tensorflow.c_api.TF_NewGraph()
   at Tensorflow.Graph..ctor()
   at Microsoft.ML.Transforms.Dnn.DnnUtils.LoadMetaGraph(String path)
   at Microsoft.ML.Transforms.Dnn.DnnUtils.LoadTFSessionByModelFilePath(IExceptionContext ectx, String modelFile, Boolean metaGraph)
   --- End of inner exception stack trace ---
   at Microsoft.ML.Transforms.Dnn.DnnUtils.LoadTFSessionByModelFilePath(IExceptionContext ectx, String modelFile, Boolean metaGraph)
   at Microsoft.ML.DnnCatalog.ImageClassification(ModelOperationsCatalog catalog, String featuresColumnName, String labelColumnName, String scoreColumnName, String predictedLabelColumnName, Architecture arch, Int32 epoch, Int32 batchSize, Single learningRate, Boolean disableEarlyStopping, EarlyStopping earlyStopping, ImageClassificationMetricsCallback metricsCallback, Int32 statisticFrequency, DnnFramework framework, String modelSavePath, String finalModelPrefix, IDataView validationSet, Boolean testOnTrainSet, Boolean reuseTrainSetBottleneckCachedValues, Boolean reuseValidationSetBottleneckCachedValues, String trainSetBottleneckCachedValuesFilePath, String validationSetBottleneckCachedValuesFilePath)
   at ImageClassification.Train.Program.Main(String[] args) in D:\GitRepos\machinelearning-samples-master\samples\csharp\getting-started\DeepLearning_ImageClassification_Training\ImageClassification.Train\Program.cs:line 56

C:\Program Files\dotnet\dotnet.exe (process 16404) exited with code -532462766.

Am I missing any additional step not provided in the procedure above?

codemzs commented 4 years ago

@CESARDELATORRE it is because you had CUDA 10.1 installed but once you reverted back to 10.0 you were able to run on GPU. Closing this issue. Feel free to add more details if needed.