dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.85k stars 2.98k forks source link

MobileNet-v1 transfer learning LOSS = INF at every save. #814

Closed pradhunmya01 closed 1 year ago

pradhunmya01 commented 3 years ago

Hi @dusty-nv I am also getting the issue but here it comes while saving the model it saves like:- mb1-ssd-Epoch-0-Loss-inf.pth mb1-ssd-Epoch-1-Loss-inf.pth mb1-ssd-Epoch-2-Loss-inf.pth Screenshot from 2020-11-17 16-09-58

dusty-nv commented 3 years ago

Hmm I wonder if there is some corrupt or incorrect annotations in your dataset?

You may need to drill down and determine which image(s) are throwing it off. Here are some techniques which can help you narrow it down:

Setting --debug-steps=1 should let you see which batch is causing the INF overflow. Then that print statement will let you know what images were loaded for that batch to check. Setting shuffle=False will make it run in the same order each time.

dusty-nv commented 3 years ago

There are probably other images in that batch that causes the INF overflow. For example, if the batch size is 8, there will be 8 images that get loaded for each batch. You could set the batch size to 2 to narrow it down to 2 images.


From: pradhunmya01 notifications@github.com Sent: Wednesday, November 25, 2020 10:30:01 AM To: dusty-nv/jetson-inference jetson-inference@noreply.github.com Cc: Dustin Franklin dustinf@nvidia.com; Mention mention@noreply.github.com Subject: Re: [dusty-nv/jetson-inference] MobileNet-v1 transfer learning LOSS = INF at every save. (#814)

Thank you @dusty-nvhttps://github.com/dusty-nv for the great suggestion, I tried the same process and get the annotation file but I doesn't seem any problem with annotation file and image please check:-

This is that annotation file and image is also perfect.

input-cropped-shelf-images-wba-03406-c003 1470-wba03406000c003-1600206461263-middle-shelf-1.jpg D:\Megha\Object Tagging_Megha\input-cropped-shelf-images-wba-03406-c003\1470-wba03406000c003-1600206461263-middle-shelf-1.jpg

Unknown

900 400 3

0

bottle Unspecified 0 0

23 148 146 380

bottle Unspecified 0 0

173 148 296 377

bottle Unspecified 0 0

300 144 412 379

bottle Unspecified 0 0

412 148 523 379

bottle Unspecified 0 0

528 148 641 371

bottle Unspecified 0 0

671 150 780 364

bottle Unspecified 0 0

788 150 895 370

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-inference/issues/814#issuecomment-733776786, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGKZAC5UT2L4RXJJUZVTSRUPHTANCNFSM4T7ESVKQ.

pradhunmya01 commented 3 years ago

yes I did the batch size to 1 and check that annotation and images even its fine Screenshot from 2020-11-25 21-14-55

can you see anything wrong here? @dusty-nv

pradhunmya01 commented 3 years ago

Another issue is that it showing this error but still the bndbox value is in the annotations Screenshot from 2020-11-25 21-19-37

dusty-nv commented 3 years ago

can you see anything wrong here? @dusty-nv

I don't exactly see what is wrong, but can you try removing that ImageID from the ImageSet files under ImageSets/Main/*.txt? This will make it so that particular image isn't loaded. Do you still get the error?

Which annotation tool did you use to create the dataset?

pradhunmya01 commented 3 years ago

can you see anything wrong here? @dusty-nv

I don't exactly see what is wrong, but can you try removing that ImageID from the ImageSet files under ImageSets/Main/*.txt? This will make it so that particular image isn't loaded. Do you still get the error?

Which annotation tool did you use to create the dataset?

I used LabelImg for annotations, and yeah I am still getting the errors with other Images, I removed that imageID from Imageset but continuously it showing the errors on other images.

dusty-nv commented 3 years ago

Is it the INF overflow problem you are still getting, or the too many indices error? Interesting that you were not getting the later error before.

Can you send me your dataset so I can try it here? It is holiday coming up here for the rest of this week, but I will try to look at it soon.

pradhunmya01 commented 3 years ago

Thank you so much @dusty-nv for this much concern, actually that dataset is different for INF and this one is different but the dataset is much similar, let me share my dataset with you I am uploading it on the google drive.

dusty-nv commented 3 years ago

OK gotcha - typically the too many indices error means that there is a malformed object in the XML, or no objects in the XML. I had checked in a fix for it probably a month ago, but perhaps it didn't cover all the corner cases.

I would check the XML file that threw the error when it tried to be loaded, and make sure that it is OK.

pradhunmya01 commented 3 years ago

yeah that will be great I guess, upload is on the way.

pradhunmya01 commented 3 years ago

Hi @dusty-nv Here's the Link of dataset:- Dataset Download please check and let me know if you need anything else. Thank you in advance

pradhunmya01 commented 3 years ago

@dusty-nv I gave you the access please check

dusty-nv commented 3 years ago

Hi @pradhunmya01 , sorry for the delay while I was catching up from Thanksgiving holiday.

When I tried to run your dataset, I got a bunch of warning like this:

warning - image 545-wba15196000c004-1600143490978-top-shelf-0 has object with unknown class 'bottlle'
warning - image 625-wba15196000c003-1600143407923-bottom-shelf-0 has object with unknown class 'bottlle'

These were then causing errors while trying to train the model. In your annotations, there were some typos in the class names. I was able to fix these by running the following commands (make a backup copy of your dataset before running these):

cd /path/to/your/dataset/Annotations
sed -i 's/bottlle/bottle/g' *.xml
sed -i 's/bottled/bottle/g' *.xml
sed -i 's/Bottlee/bottle/g' *.xml

The model then is able to train. I ran it with the default settings:

python3 train_ssd.py --dataset-type=voc --data=$DATA --model-dir=$MODEL | tee $MODEL/train_log_20201201.txt

The avg loss appears high (but not INF), but the avg classification loss is ok (which is the important one to watch). I will let it run here and see what happens. From browsing your images, it appears to be a challenging dataset.

dusty-nv commented 3 years ago

The avg loss appears high (but not INF)

The loss blew up on the second epoch (NaN). However I am re-running it now with --learning-rate=0.005 (instead of the default of 0.01) and so far it is more stable (and the losses are all in normal range now). It seems due to the difficulty of the dataset, it requires training with a reduced learning rate.

pradhunmya01 commented 3 years ago

Hi @pradhunmya01 , sorry for the delay while I was catching up from Thanksgiving holiday.

When I tried to run your dataset, I got a bunch of warning like this:

warning - image 545-wba15196000c004-1600143490978-top-shelf-0 has object with unknown class 'bottlle'
warning - image 625-wba15196000c003-1600143407923-bottom-shelf-0 has object with unknown class 'bottlle'

These were then causing errors while trying to train the model. In your annotations, there were some typos in the class names. I was able to fix these by running the following commands (make a backup copy of your dataset before running these):

cd /path/to/your/dataset/Annotations
sed -i 's/bottlle/bottle/g' *.xml
sed -i 's/bottled/bottle/g' *.xml
sed -i 's/Bottlee/bottle/g' *.xml

The model then is able to train. I ran it with the default settings:

python3 train_ssd.py --dataset-type=voc --data=$DATA --model-dir=$MODEL | tee $MODEL/train_log_20201201.txt

The avg loss appears high (but not INF), but the avg classification loss is ok (which is the important one to watch). I will let it run here and see what happens. From browsing your images, it appears to be a challenging dataset.

Thank you so much @dusty-nv it's really helpful thanks for your time, Training get started, I'll make sure these things in the future.

pradhunmya01 commented 3 years ago

The avg loss appears high (but not INF)

The loss blew up on the second epoch (NaN). However I am re-running it now with --learning-rate=0.005 (instead of the default of 0.01) and so far it is more stable (and the losses are all in normal range now). It seems due to the difficulty of the dataset, it requires training with a reduced learning rate.

what do you think other then learning rate what will decrease the loss with this dataset?