dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

ImageClassification training stops #4558

Closed mveeris closed 4 years ago

mveeris commented 4 years ago

System information

Issue

Source code / logs

This is the last log in the output window. After that CPU activity remains high for few seconds and then drops to 0.

Phase: Bottleneck Computation, Dataset used: Validation, Image Index: 405 Phase: Bottleneck Computation, Dataset used: Validation, Image Index: 406 Phase: Bottleneck Computation, Dataset used: Validation, Image Index: 407 [Source=RowToRowMapperTransform; Cursor, Kind=Trace] Channel finished. Elapsed 00:00:11.0930351. [Source=RowToRowMapperTransform; Cursor, Kind=Trace] Channel disposed [Source=GenerateNumber; Cursor, Kind=Trace] Channel finished. Elapsed 00:00:11.0635798. [Source=GenerateNumber; Cursor, Kind=Trace] Channel disposed [Source=RangeFilter; Cursor, Kind=Trace] Channel finished. Elapsed 00:00:11.0637819. [Source=RangeFilter; Cursor, Kind=Trace] Channel disposed [Source=TextLoader; Binding, Kind=Trace] Channel started [Source=TextLoader; ParseStats, Kind=Trace] Channel started [Source=TextLoader; ParseStats, Kind=Trace] Channel finished. Elapsed 00:00:00.0109063. [Source=TextLoader; ParseStats, Kind=Trace] Channel disposed [Source=TextLoader; Binding, Kind=Trace] Channel finished. Elapsed 00:00:00.0332705. [Source=TextLoader; Binding, Kind=Trace] Channel disposed [Source=TextLoader; Binding, Kind=Trace] Channel started [Source=TextLoader; ParseStats, Kind=Trace] Channel started [Source=TextLoader; ParseStats, Kind=Trace] Channel finished. Elapsed 00:00:00.0050095. [Source=TextLoader; ParseStats, Kind=Trace] Channel disposed [Source=TextLoader; Binding, Kind=Trace] Channel finished. Elapsed 00:00:00.0168461. [Source=TextLoader; Binding, Kind=Trace] Channel disposed 'ImageClassificationNetCore.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Runtime.CompilerServices.Unsafe.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. 'ImageClassificationNetCore.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Text.RegularExpressions.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. The thread 0x11e8 has exited with code 0 (0x0). [Source=TextLoader; ParseStats, Kind=Trace] Channel started [Source=TextLoader; Cursor, Kind=Trace] Channel started [Source=Shuffle; Cursor, Kind=Trace] Channel started

codemzs commented 4 years ago

@bpstark is investigating this. Do you observe this on CPU or GPU? I know you mentioned CPU but just wanted to be sure you weren't using GPU.

mveeris commented 4 years ago

I observed CPU. But I have tried training on both CPU and GPU. There is exactlly the same behaviour. I also tried on separate computer and it is also the same.

bpstark commented 4 years ago

@mveeris can you please provide the dataset you are using, as well as the pipeline you created to train with. I have tried to reproduce with a larger dataset (cifar-10) and could not reproduce the same issue. I ran the following example with no issues https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/MulticlassClassification/ImageClassification/LearningRateSchedulingCifarResnetTransferLearning.cs

mveeris commented 4 years ago

Here is my dataset: https://1drv.ms/u/s!AiEAI70qxDbMg-9r5-7imSSfdFnnAQ?e=Zw9esT Here is training code: https://1drv.ms/u/s!AiEAI70qxDbMhJBy1nVfdeNZimPvAw?e=61b2FF I tried to run your provided sample with my data, also stoped. And now I also run with cifar-10 - same. Log output last lines:

Phase: Bottleneck Computation, Dataset used: Validation, Image Index: 9998 Phase: Bottleneck Computation, Dataset used: Validation, Image Index: 9999 Phase: Bottleneck Computation, Dataset used: Validation, Image Index: 10000 'ImageClassificationNetCore.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Runtime.CompilerServices.Unsafe.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. 'ImageClassificationNetCore.exe' (CoreCLR: clrhost): Loaded 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.0.0\System.Text.RegularExpressions.dll'. Skipped loading symbols. Module is optimized and the debugger option 'Just My Code' is enabled. The thread 0x64b8 has exited with code 0 (0x0).

bpstark commented 4 years ago

@mveeris Can you give the specs for the hardware you are running this on? I copied your training code, and was able to train without any issue. So far I am unable to reproduce the issue you are seeing.

mveeris commented 4 years ago

GPU: GTX 1060 Notebook CPU: i7-6700HQ CUDA Version 10.0.130 16 Gb RAM

bpstark commented 4 years ago

@mveeris Interesting, odd request, have you tried hitting the enter key a few times into the output console when it hangs. I have seen issues with the console not continuing until the newline gets flushed to console.

mveeris commented 4 years ago

It works now. Thanks to your last comment I figured out that I don't have a console. So I created console application and it works. For training I initially made .NET Core WinForms application. I didn't know that there is any kind of difference since it is also .NET Core. Seems that the problem is related to this. For now I got the same behaviour in three different computers. In initial solution it took time to find a way to get to the training at all. There was some kind of problem if I suffled input rows. I had to raise sufflePoolSize to very high value or it just stopped loading image files during tranform and hanged the same way as in training.

Thanks a lot for thinking and helping :)

bpstark commented 4 years ago

No problem, glad we could get it working I am going to close this issue then as everything seems to be working, feel free to open a new issue for future problems.