dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
266 stars 56 forks source link

Training Failed choosing local CPU in Object detection scenario, 'TorchSharp.torch' threw an exception #2891

Open NickAtMixxus opened 7 months ago

NickAtMixxus commented 7 months ago

System Information (please complete the following information):

Describe the bug Training Failed choosing local CPU in Object detection scenario

To Reproduce In .mbconfig UI Scenario, choose Object detection, in Environment choose Local (CPU), Next step, in Data add .json file, created from vott after labeling images. (Images are then visible in Data Preview Next step, Train Training starts but is soon interrupted with Model Builder Error: The type initializer for 'TorchSharp.torch' threw an exception.

Expected behavior Expected to have a completed Training or some info if something is not installed or missing.

Additional context Something may be wrong in my setup, I have had an older version of Model builder that I could train successfully. I have tried new projects, a clean install of Visual Studio and removed and reinstalled , both auto and manual installation of Model Builder (from Visual Studio Marketplace), tried two sets of different images, tagged in Vott.

Also noticing that under Advanced training options, in the fields Score threshold and IOU threshold the default settings (0.35 and 0.5) has red text next to them: "Not a valid input. The input must be a valid float and m"... it's impossible to read the rest of the sentence, and you can't expand the window. Trying to change those values have so far made no difference.

Some more details From Log file End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLEngine.d_21.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 212

vulvquang commented 7 months ago

I had the same error, however you can try to use Object Detection on Aruze Machine Learning.

NickAtMixxus commented 7 months ago

Thanks, good to know! I'll just have to wait for some update then. About Azure, I thought I'd test on CPU/GPU first to not spend anything before I know I have some useful images. But I tried to start the Azure alternative but I'm not done yet. Even though watching a bunch of quick start videos. Even though I could create a free account it is pretty hard to understand which way to go to create a compute. In the Azure portal or in the Azure Machine Learning Studio? The profile looks a bit different in portal vs studio. (There are so many docs and tutorials for those with different dates also docs for AI Studio, AutoML Computer vision.) I get the impression that Azure and Machine Learning area is frequently updated and changed and it's a bit confusing now to keep up.

LittleLittleCloud commented 7 months ago

@NickAtMixxus Can you share the complete log of model builder(where you can find the url on visual studio output window),, from you mention (The type initializer for 'TorchSharp.torch' threw an exception.) I wonder if this error is caused by the failure of loading torch.dll

NickAtMixxus commented 7 months ago

@LittleLittleCloud

OK. Just a quick update, I have also tried training on GPU, that training is also quickly aborted with an error message. (I have installed CUDA 10.1 since that was mentioned in an Image Classification tutorial, I tried previously with a recent CUDA version but did not seem to matter.)

Attaching files with message from Output window from CPU training and GPU training, and complete LOG files also from both CPU and GPU training. OutputWinObjDetTrainCPU.txt OutputWinObjDetTrainGPU.txt LOGcopyCPUtrain.txt LOGcopyGPUtrain.txt

And here are the error messages for each training

Error Message CPU Training

The type initializer for 'TorchSharp.torch' threw an exception.

at Microsoft.ML.TorchSharp.Utils.TorchUtils.InitializeDevice(IHostEnvironment env) at Microsoft.ML.TorchSharp.AutoFormerV2.ObjectDetectionTrainer.Trainer..ctor(ObjectDetectionTrainer parent, IChannel ch, IDataView input) at Microsoft.ML.TorchSharp.AutoFormerV2.ObjectDetectionTrainer.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) at Microsoft.ML.AutoML.SweepablePipelineRunner.Run(TrialSettings settings) at Microsoft.ML.AutoML.SweepablePipelineRunner.RunAsync(TrialSettings settings, CancellationToken ct) at Microsoft.ML.AutoML.AutoMLExperiment.d24.MoveNext() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLService.LocalObjectDetectionExperiment.d13.MoveNext() in /_/src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalObjectDetectionExperiment.cs:line 133 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLEngine.d_21.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 212

Error Message GPU Training

The type initializer for 'TorchSharp.torch' threw an exception. at TorchSharp.torch.TryInitializeDeviceType(DeviceType deviceType) at Microsoft.ML.ModelBuilder.AutoMLService.LocalObjectDetectionExperiment.d_13.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalObjectDetectionExperiment.cs:line 53 --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.ML.ModelBuilder.AutoMLEngine.d_21.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 212