dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
267 stars 56 forks source link

ML.NET CLI object detection failed in Linux container #2945

Open ytchen2833 opened 4 months ago

ytchen2833 commented 4 months ago

System Information (please complete the following information):

Describe the bug

Telemetry

The ML.NET CLI tool collects usage data in order to help us improve your experience. The data doesn't include personal information or data from your datasets. You can opt-out of telemetry by setting the MLDOTNET_CLI_TELEMETRY_OPTOUT environment variable to '1' or 'true' using your f avorite shell.

Read more about ML.NET CLI telemetry: https://aka.ms/mlnet-cli-telemetry

Start Training Image List: Image: file:/app/Stop-Signs/yannis-h-Sqez8_QTi8o-unsplash.jpg Image: file:/app/Stop-Signs/will-porada-ZaGcU6BxJEc-unsplash.jpg Image: file:/app/Stop-Signs/untitled-photo-3d6zCZ4lpBE-unsplash.jpg Image: file:/app/Stop-Signs/tyler-nix-ahee6DMcUcI-unsplash.jpg Image: file:/app/Stop-Signs/tom-dillon-t9Eaei-jz7Y-unsplash.jpg Image: file:/app/Stop-Signs/suad-kamardeen-EcQW_Caifz8-unsplash.jpg Image: file:/app/Stop-Signs/sandy-ching-ixLUOtNSSHQ-unsplash.jpg Image: file:/app/Stop-Signs/samuel-sng-Uj5tQyHS2d0-unsplash.jpg Image: file:/app/Stop-Signs/sam-xu-FgY6bF6emj0-unsplash.jpg Image: file:/app/Stop-Signs/ron-mcclenny-EpHH_NKwKkE-unsplash.jpg Image: file:/app/Stop-Signs/renan-kamikoga-vxx6ilmR-W4-unsplash.jpg Image: file:/app/Stop-Signs/phil-garrison-ezvpHWyqsYg-unsplash.jpg Image: file:/app/Stop-Signs/pedro-da-silva-unEmGQqdO7Q-unsplash.jpg Image: file:/app/Stop-Signs/olivia-connell-Tc9KWrlOL0E-unsplash.jpg Image: file:/app/Stop-Signs/naina-vij--j35s3zjPKU-unsplash.jpg Image: file:/app/Stop-Signs/melanie-these-mXIViwsTvIc-unsplash.jpg Image: file:/app/Stop-Signs/mason-wilkes-q-nm36mpsDw-unsplash.jpg Image: file:/app/Stop-Signs/marcos-mathias-Jd7jw1Vf_aI-unsplash.jpg Image: file:/app/Stop-Signs/luke-van-zyl-rKSHh6nEG1g-unsplash.jpg Image: file:/app/Stop-Signs/kevork-kurdoghlian-eB2YX2TzNIA-unsplash.jpg Image: file:/app/Stop-Signs/kevin-lee-dU8dAD8KoOI-unsplash.jpg Image: file:/app/Stop-Signs/kelly-sikkema-4KzwQGsDRvA-unsplash.jpg Image: file:/app/Stop-Signs/juli-kosolapova-DmtblAatFtk-unsplash.jpg Image: file:/app/Stop-Signs/joshua-hoehne-WPrTKRw8KRQ-unsplash.jpg Image: file:/app/Stop-Signs/josh-wilburne-3Cs4mF7fL3w-unsplash.jpg Image: file:/app/Stop-Signs/jose-alonso-fl9kHTSPSvk-unsplash.jpg Image: file:/app/Stop-Signs/jon-tyson-QNp4m7gU7BA-unsplash.jpg Image: file:/app/Stop-Signs/jon-tyson-1IqQDH6KgdU-unsplash.jpg Image: file:/app/Stop-Signs/john-matychuk-dJdcb11aboQ-unsplash.jpg Image: file:/app/Stop-Signs/joel-mott-9r9Ex5iEc5o-unsplash.jpg Image: file:/app/Stop-Signs/jad-limcaco-Y_J0phaFy2g-unsplash.jpg Image: file:/app/Stop-Signs/giorgio-trovato-7PUrk4B18tY-unsplash.jpg Image: file:/app/Stop-Signs/free-to-use-sounds-Vkt3uDeDkdg-unsplash.jpg Image: file:/app/Stop-Signs/emiel-van-betsbrugge-rogwZG1NfII-unsplash.jpg Image: file:/app/Stop-Signs/eilis-garvey-rb_PpjzWKnU-unsplash.jpg Image: file:/app/Stop-Signs/doyoun-seo-Xe1S9aq2fqg-unsplash.jpg Image: file:/app/Stop-Signs/diego-lozano-AmZC7bCrsko-unsplash.jpg Image: file:/app/Stop-Signs/david-preston-mW2NETqR49A-unsplash.jpg Image: file:/app/Stop-Signs/david-preston--t7S0WPRr4E-unsplash.jpg Image: file:/app/Stop-Signs/chris-benson-h0UG2Bd_Few-unsplash.jpg Image: file:/app/Stop-Signs/chris-bair-PJLDC3tA0Sc-unsplash.jpg Image: file:/app/Stop-Signs/brantley-neal-_CAvB1vYIlY-unsplash.jpg Image: file:/app/Stop-Signs/branden-tate-XgEHOPn7h_E-unsplash.jpg Image: file:/app/Stop-Signs/bogomil-mihaylov-OHxTNeAtNRs-unsplash.jpg Image: file:/app/Stop-Signs/ben-mater-YO3iFGBN6TU-unsplash.jpg Image: file:/app/Stop-Signs/arthur-osipyan-vLusIJAYy_Q-unsplash.jpg Image: file:/app/Stop-Signs/anton-mishin-_AR3i6Gck0Q-unsplash.jpg Image: file:/app/Stop-Signs/andrii-leonov-W_rQAwVRPgg-unsplash.jpg Image: file:/app/Stop-Signs/alexandre-lecocq-ndBWgMLw6Bc-unsplash.jpg Image: file:/app/Stop-Signs/ajda-atz-HEKgHLpNgGk-unsplash.jpg start Object detection try to load libtorch.so from /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5 env:path: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/.dotnet/tools restore "/root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/RuntimeManager/torchsha rp.cpu.csproj" --configfile "/root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/Run timeManager/NuGet.config" -r linux-x64 /p:UsingToolXliff=false /p:TorchSharpVersion=0.101.5 /p:TorchSharpCudaRuntimeVersion= 2.1.0.1 /p:TensorflowRuntimeVersion=2.3.1 /p:BaseIntermediateOutputPath="/root/.local/share/ModelBuilder/torchsharp-cpu-0.10 1.5\obj" publish "/root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/RuntimeManager/torchsha rp.cpu.csproj" -r linux-x64 -c Release --no-self-contained -o "/root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5" --no- restore /p:UsingToolXliff=false /p:TorchSharpVersion=0.101.5 /p:TorchSharpCudaRuntimeVersion=2.1.0.1 /p:TensorflowRuntimeVer sion=2.3.1 /p:BaseOutputPath="/root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5\bin\" /p:BaseIntermediateOutputPath="/ root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5\obj\" start installing runtime in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5 Determining projects to restore... Restored /root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/RuntimeManager/torchsharp.cpu.csproj (in 39.78 sec).

torchsharp.cpu -> /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/bin/Release/netstandard2.0/linux-x64/torchsharp.cpu.dll torchsharp.cpu -> /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/

install runtime successfully try to load libtorch.so from /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5 load libtorch.so from /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5 success [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ObjectDetectionTrainer; TrainModel, Kind=Trace] Channel started [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ObjectDetectionTrainer; TrainModel, Kind=Trace] Channel finished. Elapsed 00:00:00.8757638. [Source=AutoMLExperiment-ChildContext, Kind=Trace] [Source=ObjectDetectionTrainer; TrainModel, Kind=Trace] Channel disposed System.Runtime.InteropServices.ExternalException (0x80004005): select(): index 0 out of range for tensor of size [0, 256, 3, 3] at dimension 0 Exception raised from select_symint at ../aten/src/ATen/native/TensorShape.cpp:1813 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x71cecc086047 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libc10.so) frame #1: + 0x111a36a (0x718de24ec36a in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #2: + 0x2c01de3 (0x718de3fd3de3 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #3: + 0x2c01fa9 (0x718de3fd3fa9 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #4: at::_ops::select_int::redispatch(c10::DispatchKeySet, at::Tensor const&, long, c10::SymInt) + 0xc5 (0x718de3bcfee5 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #5: + 0x4a85709 (0x718de5e57709 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #6: + 0x4a85a3c (0x718de5e57a3c in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #7: at::_ops::select_int::redispatch(c10::DispatchKeySet, at::Tensor const&, long, c10::SymInt) + 0xc5 (0x718de3bcfee5 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #8: + 0x43931ca (0x718de57651ca in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #9: + 0x439397c (0x718de576597c in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #10: at::_ops::select_int::call(at::Tensor const&, long, c10::SymInt) + 0x1b0 (0x718de3c31670 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #11: + 0x5747b98 (0x718de6b19b98 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #12: + 0x57483c4 (0x718de6b1a3c4 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #13: torch::nn::init::kaiminguniform(at::Tensor, double, c10::variant<torch::enumtype::kFanIn, torch::enumtype::kFanOut>, c10::variant<torch::enumtype::kLinear, torch::enumtype::kConv1D, torch::enumtype::kConv2D, torch::enumtype::kConv3D, torch::enumtype::kConvTranspose1D, torch::enumtype::kConvTranspose2D, torch::enumtype::kConvTranspose3D, torch::enumtype::kSigmoid, torch::enumtype::kTanh, torch::enumtype::kReLU, torch::enumtype::kLeakyReLU>) + 0x64 (0x718de6b1aa64 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #14: + 0x57a94f4 (0x718de6b7b4f4 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #15: + 0x57af5db (0x718de6b815db in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #16: torch::nn::Conv2dImpl::Conv2dImpl(torch::nn::ConvOptions<2ul>) + 0x2a8 (0x718de6b77e98 in /root/.local/share/ModelBuilder/torchsharp-cpu-0.101.5/libtorch_cpu.so) frame #17: + 0x163578 (0x718da3363578 in /root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/libLibTorchSharp.so) frame #18: + 0x11028f (0x718da331028f in /root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/libLibTorchSharp.so) frame #19: THSNN_Conv2d_ctor + 0xfc (0x718da330347c in /root/.dotnet/tools/.store/mlnet-linux-x64/16.18.2/mlnet-linux-x64/16.18.2/tools/net8.0/any/libLibTorchSharp.so) frame #20: [0x71ced0d2dce3]

at TorchSharp.torch.CheckForErrors() at TorchSharp.torch.nn.Conv2d(Int64 inputChannel, Int64 outputChannel, Int64 kernelSize, Int64 stride, Int64 padding, Int64 dilation, PaddingModes paddingMode, Int64 groups, Boolean bias, Device device, Nullable1 dtype) at Microsoft.ML.TorchSharp.AutoFormerV2.RetinaHead..ctor(Int32 numClasses, Int32 inChannels, Int32 stackedConvs, Int32 featChannels, Int32 numBasePriors) at Microsoft.ML.TorchSharp.AutoFormerV2.AutoFormerV2..ctor(Int32 numClasses, List1 embedChannels, List1 depths, List1 numHeads, Device device) at Microsoft.ML.TorchSharp.AutoFormerV2.ObjectDetectionTrainer.Trainer..ctor(ObjectDetectionTrainer parent, IChannel ch, IDataView input) at Microsoft.ML.TorchSharp.AutoFormerV2.ObjectDetectionTrainer.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input) at Microsoft.ML.AutoML.SweepablePipelineRunner.Run(TrialSettings settings) at Microsoft.ML.AutoML.SweepablePipelineRunner.RunAsync(TrialSettings settings, CancellationToken ct) at Microsoft.ML.AutoML.AutoMLExperiment.RunAsync(CancellationToken ct) at Microsoft.ML.ModelBuilder.AutoMLService.LocalObjectDetectionExperiment.ExecuteAsync(IDataView trainData, IDataView validateData, CancellationToken ct) in //src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/LocalObjectDetectionExperiment.cs:line 133 at Microsoft.ML.ModelBuilder.AutoMLEngine.StartTrainingAsync(ITrainingConfiguration config, PathConfiguration pathConfig, CancellationToken userCancellationToken) in //src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 178 at Microsoft.ML.CLI.Runners.AutoMLRunner.ExecuteAsync() in //src/mlnet/Runners/AutoMLRunner.cs:line 95 at Microsoft.ML.CLI.Program.TrainAsync(ITrainingConfiguration trainingConfiguration, PathConfiguration pathConfig, AutoMLServiceLogLevel logLevel) in //src/mlnet/Program.cs:line 428 at Microsoft.ML.CLI.Program.<>c.<b54>d.MoveNext() in //src/mlnet/Program.cs:line 183 --- End of stack trace from previous location --- at System.CommandLine.Invocation.CommandHandler.GetExitCodeAsync(Object value, InvocationContext context) at System.CommandLine.Invocation.ModelBindingCommandHandler.InvokeAsync(InvocationContext context) at System.CommandLine.Invocation.InvocationPipeline.<>cDisplayClass4_0.<b0>d.MoveNext() --- End of stack trace from previous location --- at System.CommandLine.Builder.CommandLineBuilderExtensions.<>cDisplayClass23_0.<b0>d.MoveNext() --- End of stack trace from previous location --- at Microsoft.ML.CLI.Program.<>cDisplayClass5_0.<b11>d.MoveNext() in /_/src/mlnet/Program.cs:line 360 --- End of stack trace from previous location --- at System.CommandLine.Builder.CommandLineBuilderExtensions.<>c.<b__24_0>d.MoveNext() --- End of stack trace from previous location --- at System.CommandLine.Builder.CommandLineBuilderExtensions.<>cDisplayClass22_0.<b0>d.MoveNext() --- End of stack trace from previous location --- at System.CommandLine.Builder.CommandLineBuilderExtensions.<>cDisplayClass11_0.<b0>d.MoveNext() --- End of stack trace from previous location --- at System.CommandLine.Builder.CommandLineBuilderExtensions.<>c.<b10_0>d.MoveNext() --- End of stack trace from previous location --- at System.CommandLine.Builder.CommandLineBuilderExtensions.<>c__DisplayClass14_0.<b__0>d.MoveNext()



**To Reproduce**
Steps to reproduce the behavior:
1. Follow steps in [ML.NET object detection turorial](https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/object-detection-model-builder#create-a-new-vott-project), create project in `$HOME/docker/mlnet/Stop-Signs/` and export `$HOME/docker/mlnet/Stop-Signs/vott-json-export/StopSignObjDetection-export.json` file.
2. Move dataset to `$HOME/docker/mlnet`.
3. Create container `sudo docker run --name sdk8 --gpus all -it --rm --ipc=host -v $HOME/docker/mlnet:/app mcr.microsoft.com/dotnet/nightly/sdk:8.0 bash`
4. (Command in container as shown below)
5. Install mlnet-cli `dotnet tool install --global mlnet-linux-x64`
6. `export PATH=$PATH:/root/.dotnet/tools`
7. `cd /app/Stop-Signs`
8. `mlnet object-detection --dataset ./vott-json-export/StopSignObjDetection-export.json`
9. See error

**Expected behavior**
Finish object detection training.

**Additional context**
Classification and image-classification performed very well. 👏👏👏