Closed pradhunmya01 closed 1 year ago
Hmm I wonder if there is some corrupt or incorrect annotations in your dataset?
You may need to drill down and determine which image(s) are throwing it off. Here are some techniques which can help you narrow it down:
shuffle=False
in train_ssd.py:237print()
in voc_dataset.py:76--debug-steps=1
optionSetting --debug-steps=1
should let you see which batch is causing the INF overflow. Then that print statement will let you know what images were loaded for that batch to check. Setting shuffle=False
will make it run in the same order each time.
There are probably other images in that batch that causes the INF overflow. For example, if the batch size is 8, there will be 8 images that get loaded for each batch. You could set the batch size to 2 to narrow it down to 2 images.
From: pradhunmya01 notifications@github.com Sent: Wednesday, November 25, 2020 10:30:01 AM To: dusty-nv/jetson-inference jetson-inference@noreply.github.com Cc: Dustin Franklin dustinf@nvidia.com; Mention mention@noreply.github.com Subject: Re: [dusty-nv/jetson-inference] MobileNet-v1 transfer learning LOSS = INF at every save. (#814)
Thank you @dusty-nvhttps://github.com/dusty-nv for the great suggestion, I tried the same process and get the annotation file but I doesn't seem any problem with annotation file and image please check:-
This is that annotation file and image is also perfect.
input-cropped-shelf-images-wba-03406-c003 1470-wba03406000c003-1600206461263-middle-shelf-1.jpg D:\Megha\Object Tagging_Megha\input-cropped-shelf-images-wba-03406-c003\1470-wba03406000c003-1600206461263-middle-shelf-1.jpg
Unknown
900 400 3
0
bottle Unspecified 0 0
23 148 146 380
bottle Unspecified 0 0
173 148 296 377
bottle Unspecified 0 0
300 144 412 379
bottle Unspecified 0 0
412 148 523 379
bottle Unspecified 0 0
528 148 641 371
bottle Unspecified 0 0
671 150 780 364
bottle Unspecified 0 0
788 150 895 370
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-inference/issues/814#issuecomment-733776786, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGKZAC5UT2L4RXJJUZVTSRUPHTANCNFSM4T7ESVKQ.
yes I did the batch size to 1 and check that annotation and images even its fine
can you see anything wrong here? @dusty-nv
Another issue is that it showing this error but still the bndbox value is in the annotations
can you see anything wrong here? @dusty-nv
I don't exactly see what is wrong, but can you try removing that ImageID from the ImageSet files under ImageSets/Main/*.txt
? This will make it so that particular image isn't loaded. Do you still get the error?
Which annotation tool did you use to create the dataset?
can you see anything wrong here? @dusty-nv
I don't exactly see what is wrong, but can you try removing that ImageID from the ImageSet files under
ImageSets/Main/*.txt
? This will make it so that particular image isn't loaded. Do you still get the error?Which annotation tool did you use to create the dataset?
I used LabelImg for annotations, and yeah I am still getting the errors with other Images, I removed that imageID from Imageset but continuously it showing the errors on other images.
Is it the INF overflow problem you are still getting, or the too many indices
error? Interesting that you were not getting the later error before.
Can you send me your dataset so I can try it here? It is holiday coming up here for the rest of this week, but I will try to look at it soon.
Thank you so much @dusty-nv for this much concern, actually that dataset is different for INF and this one is different but the dataset is much similar, let me share my dataset with you I am uploading it on the google drive.
OK gotcha - typically the too many indices
error means that there is a malformed object in the XML, or no objects in the XML. I had checked in a fix for it probably a month ago, but perhaps it didn't cover all the corner cases.
I would check the XML file that threw the error when it tried to be loaded, and make sure that it is OK.
yeah that will be great I guess, upload is on the way.
Hi @dusty-nv Here's the Link of dataset:- Dataset Download please check and let me know if you need anything else. Thank you in advance
@dusty-nv I gave you the access please check
Hi @pradhunmya01 , sorry for the delay while I was catching up from Thanksgiving holiday.
When I tried to run your dataset, I got a bunch of warning like this:
warning - image 545-wba15196000c004-1600143490978-top-shelf-0 has object with unknown class 'bottlle'
warning - image 625-wba15196000c003-1600143407923-bottom-shelf-0 has object with unknown class 'bottlle'
These were then causing errors while trying to train the model. In your annotations, there were some typos in the class names. I was able to fix these by running the following commands (make a backup copy of your dataset before running these):
cd /path/to/your/dataset/Annotations
sed -i 's/bottlle/bottle/g' *.xml
sed -i 's/bottled/bottle/g' *.xml
sed -i 's/Bottlee/bottle/g' *.xml
The model then is able to train. I ran it with the default settings:
python3 train_ssd.py --dataset-type=voc --data=$DATA --model-dir=$MODEL | tee $MODEL/train_log_20201201.txt
The avg loss appears high (but not INF), but the avg classification loss is ok (which is the important one to watch). I will let it run here and see what happens. From browsing your images, it appears to be a challenging dataset.
The avg loss appears high (but not INF)
The loss blew up on the second epoch (NaN). However I am re-running it now with --learning-rate=0.005
(instead of the default of 0.01
) and so far it is more stable (and the losses are all in normal range now). It seems due to the difficulty of the dataset, it requires training with a reduced learning rate.
Hi @pradhunmya01 , sorry for the delay while I was catching up from Thanksgiving holiday.
When I tried to run your dataset, I got a bunch of warning like this:
warning - image 545-wba15196000c004-1600143490978-top-shelf-0 has object with unknown class 'bottlle' warning - image 625-wba15196000c003-1600143407923-bottom-shelf-0 has object with unknown class 'bottlle'
These were then causing errors while trying to train the model. In your annotations, there were some typos in the class names. I was able to fix these by running the following commands (make a backup copy of your dataset before running these):
cd /path/to/your/dataset/Annotations sed -i 's/bottlle/bottle/g' *.xml sed -i 's/bottled/bottle/g' *.xml sed -i 's/Bottlee/bottle/g' *.xml
The model then is able to train. I ran it with the default settings:
python3 train_ssd.py --dataset-type=voc --data=$DATA --model-dir=$MODEL | tee $MODEL/train_log_20201201.txt
The avg loss appears high (but not INF), but the avg classification loss is ok (which is the important one to watch). I will let it run here and see what happens. From browsing your images, it appears to be a challenging dataset.
Thank you so much @dusty-nv it's really helpful thanks for your time, Training get started, I'll make sure these things in the future.
The avg loss appears high (but not INF)
The loss blew up on the second epoch (NaN). However I am re-running it now with
--learning-rate=0.005
(instead of the default of0.01
) and so far it is more stable (and the losses are all in normal range now). It seems due to the difficulty of the dataset, it requires training with a reduced learning rate.
what do you think other then learning rate what will decrease the loss with this dataset?
Hi @dusty-nv I am also getting the issue but here it comes while saving the model it saves like:- mb1-ssd-Epoch-0-Loss-inf.pth mb1-ssd-Epoch-1-Loss-inf.pth mb1-ssd-Epoch-2-Loss-inf.pth