dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.85k stars 2.98k forks source link

Custom object detection model training keeps failing #1806

Open olutsiv opened 8 months ago

olutsiv commented 8 months ago

Hello, I am working on my senior design project that involves object detection. I'm having some issues.I keep getting a error when I try to train a model using custom data for object detection on my jetson nano orin. Ive used the webcam to create a mini test model and I am able to train that with ~30 images. But when I try and train a custom dataset it doesn't work. Ive tried what seems everything, looked over all the posts regarding this and nothing seems to be working. Here is what I have done and tried so far.

I am using the container as i was having trouble getting torchvision to work. I have the pascal VOC directory properly set up with Annotations, JPEGImages, ImageSets, and the labels.txt file. In the ImageSets folder I have a Main folder and then train.txt, test.txt, val.txt, and trainval.txt. I have around 1300 images and 1 class and i tried messing with the workers and batch size because I though maybe I was running low on memory, I tried mounting a 4GB swap but nothing seems to be working. This is the error i keep getting. I also don't know why its not finding some images even thou I double and tripped checked that they are there.

Any help would be very much appreciated. Ive been troubleshooting for like 2 days straight now and don't know what to try now. Thank you in advance.

python3 train_ssd.py --dataset-type=voc --data=data/ambulance1 --model-dir=models/ambulance7 --batch-size=4 --workers=2 --epochs=10 2024-03-09 00:06:23 - Using CUDA... 2024-03-09 00:06:23 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/ambulance7', dataset_type='voc', datasets=['data/ambulance1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=10, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2024-03-09 00:06:35 - model resolution 300x300 2024-03-09 00:06:35 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2024-03-09 00:06:35 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2024-03-09 00:06:35 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2024-03-09 00:06:35 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2024-03-09 00:06:35 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2024-03-09 00:06:35 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2024-03-09 00:06:35 - Prepare training datasets. warning - could not find image 3cfdd5e2-generic-ambulance-AMR-a-05142023 - ignoring from dataset warning - could not find image 4f8f60a7-650a52603d850 - ignoring from dataset warning - could not find image 4f9b6b9a-Untitled-design-19 - ignoring from dataset warning - could not find image 5b5b65ce-amr-ambulance-races-past-siren-footage-068750631_iconl - ignoring from dataset warning - could not find image 5d7070aa-boulder-co-usa-october-13-600nw-2391689537 - ignoring from dataset warning - could not find image 5e76d71a-ambulance-nighttime-footage-000808944_iconl - ignoring from dataset warning - could not find image 6a8d7053-7df4093f-cfc8-47b8-b49f-4434677d94fd-ambulance - ignoring from dataset 2024-03-09 00:06:36 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-09 00:06:36 - Stored labels into file models/ambulance7/labels.txt. 2024-03-09 00:06:36 - Train dataset size: 193 2024-03-09 00:06:36 - Prepare Validation datasets. warning - could not find image 3cfdd5e2-generic-ambulance-AMR-a-05142023 - ignoring from dataset warning - could not find image 4f8f60a7-650a52603d850 - ignoring from dataset warning - could not find image 4f9b6b9a-Untitled-design-19 - ignoring from dataset warning - could not find image 5b5b65ce-amr-ambulance-races-past-siren-footage-068750631_iconl - ignoring from dataset warning - could not find image 5d7070aa-boulder-co-usa-october-13-600nw-2391689537 - ignoring from dataset warning - could not find image 5e76d71a-ambulance-nighttime-footage-000808944_iconl - ignoring from dataset warning - could not find image 6a8d7053-7df4093f-cfc8-47b8-b49f-4434677d94fd-ambulance - ignoring from dataset 2024-03-09 00:06:36 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-09 00:06:36 - Validation dataset size: 193 2024-03-09 00:06:36 - Build network. 2024-03-09 00:06:36 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth 2024-03-09 00:06:36 - Took 0.69 seconds to load the model. 2024-03-09 00:06:36 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2024-03-09 00:06:36 - Uses CosineAnnealingLR scheduler. 2024-03-09 00:06:36 - Start training from epoch 0. /usr/local/lib/python3.8/dist-packages/Pillow-9.5.0-py3.8-linux-aarch64.egg/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images warnings.warn( /usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Killed

olutsiv commented 8 months ago

I get this error when I only included 60 pictures in the data. Same 60 picture IDs in the train.txt, val.txt test.txt, and trainval.txt just for testing, I know it should be split roughly 80/10/10. Not sure why I get a different error when there is less pictures but my final project will have 4 classes and roughly 6-8k images. It worrying me that I'm not able to even get 1 class with 1300 images to work

python3 train_ssd.py --dataset-type=voc --data=data/ambulance1 --model-dir=models/ambulance8 --batch-size=4 --workers=2 --epochs=1 2024-03-09 00:54:19 - Using CUDA... 2024-03-09 00:54:19 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/ambulance8', dataset_type='voc', datasets=['data/ambulance1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2024-03-09 00:54:31 - model resolution 300x300 2024-03-09 00:54:31 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2024-03-09 00:54:31 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2024-03-09 00:54:31 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2024-03-09 00:54:31 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2024-03-09 00:54:31 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2024-03-09 00:54:31 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2024-03-09 00:54:31 - Prepare training datasets. 2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-09 00:54:31 - Stored labels into file models/ambulance8/labels.txt. 2024-03-09 00:54:31 - Train dataset size: 60 2024-03-09 00:54:31 - Prepare Validation datasets. 2024-03-09 00:54:31 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-09 00:54:31 - Validation dataset size: 60 2024-03-09 00:54:31 - Build network. 2024-03-09 00:54:31 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth 2024-03-09 00:54:32 - Took 0.69 seconds to load the model. 2024-03-09 00:54:32 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2024-03-09 00:54:32 - Uses CosineAnnealingLR scheduler. 2024-03-09 00:54:32 - Start training from epoch 0. /usr/local/lib/python3.8/dist-packages/Pillow-9.5.0-py3.8-linux-aarch64.egg/PIL/Image.py:992: UserWarning: Palette images with Transparency expressed in bytes should be converted to RGBA images warnings.warn( Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get if not self._poll(timeout): File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll r = wait([self], timeout) File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait ready = selector.select(timeout) File "/usr/lib/python3.8/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 252) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train_ssd.py", line 406, in train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 139, in train for i, data in enumerate(loader): File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 634, in next data = self._next_data() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1329, in _next_data idx, data = self._get_data() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1295, in _get_data success, data = self._try_get_data() File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/dataloader.py", line 1146, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 252) exited unexpectedly

dusty-nv commented 7 months ago

@olutsiv killed means that the board ran out of memory, try decreasing the --batch-size 1 and --num-workers 1 and mounting swap, ect

olutsiv commented 7 months ago

I have tried doing all of that with no luck. Could it be the annotation XML file that's wrong? Because I'm using my own pictures labeled in XML format. At first the format was a little different compared to the XML files that the webcam labeling software produces but i wrote a script to edit them and now they are the exact same. Is it possible for the nano to run out of memory with only 30 pictures and --batch-size 1, --num-workers 1, and --epoch 1?

I uploaded the xml files that i was using. Maybe you can take a look at them and see if you can spot anything. I would really appreciate any help I can get, my group is kinda stuck right now and we need to get this working in order to finish our senior project. https://drive.google.com/drive/folders/1PQCeKoK-mGdD49nlpCE4eyMbp6KsWg8e?usp=sharing

This is the error im getting with the updated XML files, only 28 pictures and it says killed.

python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect --batch-size=1 --workers=1 --epochs=1 2024-03-28 17:02:52 - Using CUDA... 2024-03-28 17:02:52 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2024-03-28 17:03:00 - model resolution 300x300 2024-03-28 17:03:00 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2024-03-28 17:03:00 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2024-03-28 17:03:00 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2024-03-28 17:03:00 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2024-03-28 17:03:00 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2024-03-28 17:03:00 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2024-03-28 17:03:00 - Prepare training datasets. 2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-28 17:03:00 - Stored labels into file models/EMSdetect/labels.txt. 2024-03-28 17:03:00 - Train dataset size: 28 2024-03-28 17:03:00 - Prepare Validation datasets. 2024-03-28 17:03:00 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-28 17:03:00 - Validation dataset size: 28 2024-03-28 17:03:00 - Build network. 2024-03-28 17:03:00 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth 2024-03-28 17:03:01 - Took 0.68 seconds to load the model. 2024-03-28 17:03:01 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2024-03-28 17:03:01 - Uses CosineAnnealingLR scheduler. 2024-03-28 17:03:01 - Start training from epoch 0. /usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) 2024-03-28 17:03:22 - Epoch: 0, Step: 10/28, Avg Loss: 8.0487, Avg Regression Losroot@oleg:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/EMSdetect --model-dir=models/EMSdetect3 --batch-size=1 --workers=1 --epochs=1 2024-03-28 17:12:04 - Using CUDA... 2024-03-28 17:12:04 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/EMSdetect3', dataset_type='voc', datasets=['data/EMSdetect'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2024-03-28 17:12:15 - model resolution 300x300 2024-03-28 17:12:15 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2024-03-28 17:12:15 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2024-03-28 17:12:15 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2024-03-28 17:12:15 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2024-03-28 17:12:15 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2024-03-28 17:12:15 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2024-03-28 17:12:15 - Prepare training datasets. 2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-28 17:12:15 - Stored labels into file models/EMSdetect3/labels.txt. 2024-03-28 17:12:15 - Train dataset size: 28 2024-03-28 17:12:15 - Prepare Validation datasets. 2024-03-28 17:12:15 - VOC Labels read from file: ('BACKGROUND', 'ambulance') 2024-03-28 17:12:15 - Validation dataset size: 28 2024-03-28 17:12:15 - Build network. 2024-03-28 17:12:15 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth 2024-03-28 17:12:16 - Took 0.68 seconds to load the model. 2024-03-28 17:12:16 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2024-03-28 17:12:16 - Uses CosineAnnealingLR scheduler. 2024-03-28 17:12:16 - Start training from epoch 0. /usr/local/lib/python3.8/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Killed

olutsiv commented 7 months ago

I tried to train it with 1 image and it completed successfully, then 2 and it also completed successfully, 3 completed successfully as well, but once I got to 4 it either gave me the "killed" error or just froze the jetson completely. It has to do something with the pictures or labeling because I did a test with the webcam labeling software and I labeled and saved around 50 pictures and it trained those with no problems. But when I use my own pictures and data it doesn't work.

dusty-nv commented 7 months ago

What is the resolution of your own pictures? Maybe they are really large and it is keeping them in memory? Did you mount enough swap?

You can also run these pytorch training scripts on another Linux/GPU machine with more memory or in Google collab I think


From: olutsiv @.> Sent: Thursday, March 28, 2024 2:19:08 PM To: dusty-nv/jetson-inference @.> Cc: Dustin Franklin @.>; Comment @.> Subject: Re: [dusty-nv/jetson-inference] Custom object detection model training keeps failing (Issue #1806)

I tried to train it with 1 image and it completed successfully, then 2 and it also completed successfully, 3 completed successfully as well, but once I got to 4 it either gave me the "killed" error or just froze the jetson completely. It has to do something with the pictures or labeling because I did a test with the webcam labeling software and I labeled and saved around 50 pictures and it trained those with no problems. But when I use my own pictures and data it doesn't work.

— Reply to this email directly, view it on GitHubhttps://github.com/dusty-nv/jetson-inference/issues/1806#issuecomment-2025838313, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADVEGK4FOZQ2IWC2OOVT6OTY2RNJZAVCNFSM6AAAAABENT7LBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRVHAZTQMZRGM. You are receiving this because you commented.Message ID: @.***>

olutsiv commented 7 months ago

Hmmm, yah maybe it is the pictures, the resolutions vary, they are not all the same. What's the recommended or the max resolution of pictures I should be using?

I wasn't quite sure how much swap I can mount. What would you recommend for a nano orin?

dusty-nv commented 7 months ago

Swap, I typically mount the same amount as the board has RAM, so 8GB swap for Orin Nano.

I would probably keep the pictures to 1920x1080 resolution or similar...the camera-capture program captures them at 1280x720. The model downsamples them to 300x300 anyways

olutsiv commented 7 months ago

Ok thank you so much for that information. I believe the pictures we are using are all around that size or even smaller. I was mounting 4GB but I will try to mount 8 and see what happens. I’m just confused why it wasn’t even able to train 4 pictures.

dusty-nv commented 7 months ago

I'm not sure either since you said it trained fine on what you captured with camera-capture, which would lead one to believe it is related to the dataset

olutsiv commented 7 months ago

Yeah that’s the conclusion I came to too. I think we will try and relabel some of our images with CVAT and run a test model with those and see if it trains properly. If it does then we will just have to relabel all of our images with CVAT.

dusty-nv commented 7 months ago

OK gotcha - I have used CVAT in the past for this. If you have another machine with more memory capable of running PyTorch, you can do the training there too.