hi @dusty-nv , in trying to run train_ssd.py with the open images (python3 open_images_downloader.py --max-images=500 --class-names "Apple,Orange,Banana,Strawberry,Grape,Pear,Pineapple,Watermelon" --data=data/fruit)
this is the output i get, can you tell whats wrong with it? thanks in advance
python3 train_ssd.py --data=data/fruit --model-dir=models/fruit --batch-size=1 --num-workers=1 --epochs=1
2024-04-15 10:05:38 - Using CUDA...
2024-04-15 10:05:38 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/fruit', dataset_type='open_images', datasets=['data/fruit'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2024-04-15 10:06:45 - model resolution 300x300
2024-04-15 10:06:45 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2024-04-15 10:06:45 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2024-04-15 10:06:45 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2024-04-15 10:06:45 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2024-04-15 10:06:45 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2024-04-15 10:06:45 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2024-04-15 10:06:51 - Prepare training datasets.
2024-04-15 10:06:51 - loading annotations from: data/fruit/sub-train-annotations-bbox.csv
2024-04-15 10:06:52 - annotations loaded from: data/fruit/sub-train-annotations-bbox.csv
num images: 404
2024-04-15 10:06:54 - Dataset Summary:Number of Images: 404
Minimum Number of Images for a Class: -1
Label Distribution:
Apple: 261
Banana: 113
Grape: 136
Orange: 599
Pear: 191
Pineapple: 47
Strawberry: 550
Watermelon: 50
2024-04-15 10:06:54 - Stored labels into file models/fruit/labels.txt.
2024-04-15 10:06:54 - Train dataset size: 404
2024-04-15 10:06:54 - Prepare Validation datasets.
2024-04-15 10:06:54 - loading annotations from: data/fruit/sub-test-annotations-bbox.csv
2024-04-15 10:06:54 - annotations loaded from: data/fruit/sub-test-annotations-bbox.csv
num images: 73
2024-04-15 10:06:55 - Dataset Summary:Number of Images: 73
Minimum Number of Images for a Class: -1
Label Distribution:
Apple: 11
Banana: 9
Grape: 21
Orange: 62
Pear: 6
Pineapple: 10
Strawberry: 73
Watermelon: 11
2024-04-15 10:06:55 - Validation dataset size: 73
2024-04-15 10:06:55 - Build network.
2024-04-15 10:06:58 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth
2024-04-15 10:07:01 - Took 2.97 seconds to load the model.
2024-04-15 10:07:02 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2024-04-15 10:07:02 - Uses CosineAnnealingLR scheduler.
2024-04-15 10:07:02 - Start training from epoch 0.
/usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead.
warnings.warn(warning.format(ret))
Traceback (most recent call last):
File "train_ssd.py", line 406, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 149, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
hi @dusty-nv , in trying to run
train_ssd.py
with the open images (python3 open_images_downloader.py --max-images=500 --class-names "Apple,Orange,Banana,Strawberry,Grape,Pear,Pineapple,Watermelon" --data=data/fruit
)this is the output i get, can you tell whats wrong with it? thanks in advance
python3 train_ssd.py --data=data/fruit --model-dir=models/fruit --batch-size=1 --num-workers=1 --epochs=1 2024-04-15 10:05:38 - Using CUDA... 2024-04-15 10:05:38 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/fruit', dataset_type='open_images', datasets=['data/fruit'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2024-04-15 10:06:45 - model resolution 300x300 2024-04-15 10:06:45 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2024-04-15 10:06:45 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2024-04-15 10:06:45 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2024-04-15 10:06:45 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2024-04-15 10:06:45 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2024-04-15 10:06:45 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2024-04-15 10:06:51 - Prepare training datasets. 2024-04-15 10:06:51 - loading annotations from: data/fruit/sub-train-annotations-bbox.csv 2024-04-15 10:06:52 - annotations loaded from: data/fruit/sub-train-annotations-bbox.csv num images: 404 2024-04-15 10:06:54 - Dataset Summary:Number of Images: 404 Minimum Number of Images for a Class: -1 Label Distribution: Apple: 261 Banana: 113 Grape: 136 Orange: 599 Pear: 191 Pineapple: 47 Strawberry: 550 Watermelon: 50 2024-04-15 10:06:54 - Stored labels into file models/fruit/labels.txt. 2024-04-15 10:06:54 - Train dataset size: 404 2024-04-15 10:06:54 - Prepare Validation datasets. 2024-04-15 10:06:54 - loading annotations from: data/fruit/sub-test-annotations-bbox.csv 2024-04-15 10:06:54 - annotations loaded from: data/fruit/sub-test-annotations-bbox.csv num images: 73 2024-04-15 10:06:55 - Dataset Summary:Number of Images: 73 Minimum Number of Images for a Class: -1 Label Distribution: Apple: 11 Banana: 9 Grape: 21 Orange: 62 Pear: 6 Pineapple: 10 Strawberry: 73 Watermelon: 11 2024-04-15 10:06:55 - Validation dataset size: 73 2024-04-15 10:06:55 - Build network. 2024-04-15 10:06:58 - Init from pretrained SSD models/mobilenet-v1-ssd-mp-0_675.pth 2024-04-15 10:07:01 - Took 2.97 seconds to load the model. 2024-04-15 10:07:02 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2024-04-15 10:07:02 - Uses CosineAnnealingLR scheduler. 2024-04-15 10:07:02 - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='sum' instead. warnings.warn(warning.format(ret)) Traceback (most recent call last): File "train_ssd.py", line 406, in
train(train_loader, net, criterion, optimizer, device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
File "train_ssd.py", line 149, in train
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/_tensor.py", line 255, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 149, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: CUDA error: too many resources requested for launch
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.