dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Training with object detection , Loss : Nan #7027

Open Cilouche opened 5 months ago

Cilouche commented 5 months ago

Hello,

I'm working on object detection with Coco data ( two datasets 1855 images or 58661 images) After some time I receive a Loss: NaN

What's the problem please knowing that I tested different datasets ?

Any suggestion please? , Thanks

System :window Training on CPU Batch size : 5 ~ 10 image

Update on 2024/02/23 (from @LittleLittleCloud )

@LittleLittleCloud Get it reproduced on the following dataset using mlnet 16.18.2, CPU and batch size to 10, epoch to 1. On GPU the training is successful

First install mlnet-win-x64 16.18.2

dotnet tool install --global mlnet-win-x64 --version 16.18.2

Then, kick off the training

mlnet object-detection --dataset /path/to/coco.json --device cpu --epoch 1
LittleLittleCloud commented 5 months ago

@michaelgsharp my best guess is there's overflow when calculating focal loss?

https://github.com/dotnet/machinelearning/blob/902102e23d9bd825c44f203390801d7cc5d0275f/src/Microsoft.ML.TorchSharp/Loss/FocalLoss.cs#L37

Cilouche commented 4 months ago

Is there a solution or suggestion? please

LittleLittleCloud commented 4 months ago

@Cilouche which coco dataset are you using, could you share a link?

Cilouche commented 4 months ago

data : https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 ,

I used this site https://drive.google.com/drive/folders/1-dQPRdQ-MRp6mrPhnpng5pcgJsTZMg23 to convert them to coco format

LittleLittleCloud commented 4 months ago

@Cilouche which site? The site link seems to be the same with the data you share

Cilouche commented 4 months ago

Yes sorry the site : https://roboflow.com/convert/pascal-voc-xml-to-coco-json?ref=blog.roboflow.com

I've also noticed that once the database is large, there's a loss Nan example: data = 100, epoch=8; all is well except for the precesion is low ~ 0.69

but from data ~= 1200 images, epoch= 5, 8 , 11 ; losses converge rapidly towards Nan

LittleLittleCloud commented 4 months ago

Update

I got it reproduced on my second training, thanks

Origianl post

Hey @Cilouche some updates here: I can't reproduce the NaN loss error using your dataset in the latest model builder main branch. Maybe it's already been fixed.

We haven't released model builder yet, but you can verify the latest bit in mlnet cli > 16.18.2 by installing mlnet-win-x64 and try object detection there. mlnet cli and model builder shares the same AutoML service so if you didn't see NaN error from mlnet cli, then you probably also won't see NaN error from model builder

steps to verify

Cilouche commented 4 months ago

Any suggestion or solution to bypass this problem, please ? Thanks

LittleLittleCloud commented 4 months ago

Try a smaller batch size, Maybe 1?

And GPU training doesn't produce NaN loss, is that also an option for you(training on GPU)

Cilouche commented 4 months ago

It's works on GPU thanks.