Open abhinavrawat27 opened 1 year ago
Hi @abhinavrawat27, you have 16GB of GPU memory, have you tried increasing the batch size.
Also I don't typically train over 100 epochs for SSD-Mobilenet on datasets of that size, for risk of overfitting and just overall diminishing returns. The learning rate scheduler and things like that may need adjusted, I'm not sure.
Hi @dusty-nv I will try increasing the batch size but can you explain a bit more about 2nd point you made. How can we adjust the learning rate scheduler and what values please? Thanks
Hi @dusty-nv I used 32 batch size and it still feels very slow.
2023-03-26 08:10:34 - Using CUDA...
2023-03-26 08:10:34 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=32, checkpoint_folder='models/pincher', dataset_type='open_images', datasets=['data/pincher240323'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=300, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2023-03-26 08:10:42 - model resolution 300x300
2023-03-26 08:10:42 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - Prepare training datasets.
2023-03-26 08:10:42 - loading annotations from: data/pincher240323/sub-train-annotations-bbox.csv
2023-03-26 08:11:22 - Using CUDA...
2023-03-26 08:11:22 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=32, checkpoint_folder='models/pincher', dataset_type='voc', datasets=['data/pincher240323'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=300, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2023-03-26 08:11:26 - model resolution 300x300
2023-03-26 08:11:26 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - Prepare training datasets.
2023-03-26 08:11:26 - VOC Labels read from file: ('BACKGROUND', 'pincher')
2023-03-26 08:11:26 - Stored labels into file models/pincher/labels.txt.
2023-03-26 08:11:26 - Train dataset size: 614
2023-03-26 08:11:26 - Prepare Validation datasets.
2023-03-26 08:11:26 - VOC Labels read from file: ('BACKGROUND', 'pincher')
2023-03-26 08:11:26 - Validation dataset size: 108
2023-03-26 08:11:26 - Build network.
2023-03-26 08:11:26 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2023-03-26 08:11:26 - Took 0.12 seconds to load the model.
2023-03-26 08:11:26 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2023-03-26 08:11:26 - Uses CosineAnnealingLR scheduler.
2023-03-26 08:11:26 - Start training from epoch 0.
2023-03-26 08:12:02 - Epoch: 0, Step: 10/20, Avg Loss: 12.8133, Avg Regression Loss 8.0553, Avg Classification Loss: 4.7580
2023-03-26 08:12:26 - Epoch: 0, Training Loss: 10.2453, Training Regression Loss 6.4478, Training Classification Loss: 3.7974
2023-03-26 08:12:30 - Epoch: 0, Validation Loss: 7.7618, Validation Regression Loss 4.6914, Validation Classification Loss: 3.0704
2023-03-26 08:12:30 - Saved model models/pincher/mb1-ssd-Epoch-0-Loss-7.761813521385193.pth
2023-03-26 08:13:01 - Epoch: 1, Step: 10/20, Avg Loss: 7.8498, Avg Regression Loss 4.3744, Avg Classification Loss: 3.4754
2023-03-26 08:13:24 - Epoch: 1, Training Loss: 6.9818, Training Regression Loss 3.8846, Training Classification Loss: 3.0972
2023-03-26 08:13:27 - Epoch: 1, Validation Loss: 7.2093, Validation Regression Loss 4.0637, Validation Classification Loss: 3.1456
2023-03-26 08:13:28 - Saved model models/pincher/mb1-ssd-Epoch-1-Loss-7.209309101104736.pth
2023-03-26 08:13:59 - Epoch: 2, Step: 10/20, Avg Loss: 6.7346, Avg Regression Loss 3.5297, Avg Classification Loss: 3.2049
2023-03-26 08:14:23 - Epoch: 2, Training Loss: 5.9695, Training Regression Loss 3.1388, Training Classification Loss: 2.8307
2023-03-26 08:14:26 - Epoch: 2, Validation Loss: 5.3872, Validation Regression Loss 2.8208, Validation Classification Loss: 2.5664
2023-03-26 08:14:26 - Saved model models/pincher/mb1-ssd-Epoch-2-Loss-5.3872374296188354.pth
2023-03-26 08:14:55 - Epoch: 3, Step: 10/20, Avg Loss: 6.1297, Avg Regression Loss 3.2159, Avg Classification Loss: 2.9138
2023-03-26 08:15:21 - Epoch: 3, Training Loss: 5.4523, Training Regression Loss 2.8301, Training Classification Loss: 2.6222
2023-03-26 08:15:24 - Epoch: 3, Validation Loss: 5.7545, Validation Regression Loss 2.8110, Validation Classification Loss: 2.9436
2023-03-26 08:15:24 - Saved model models/pincher/mb1-ssd-Epoch-3-Loss-5.754525423049927.pth
2023-03-26 08:15:59 - Epoch: 4, Step: 10/20, Avg Loss: 5.8826, Avg Regression Loss 2.9114, Avg Classification Loss: 2.9712
2023-03-26 08:16:23 - Epoch: 4, Training Loss: 5.2831, Training Regression Loss 2.6439, Training Classification Loss: 2.6392
2023-03-26 08:16:27 - Epoch: 4, Validation Loss: 4.7360, Validation Regression Loss 2.2415, Validation Classification Loss: 2.4945
2023-03-26 08:16:27 - Saved model models/pincher/mb1-ssd-Epoch-4-Loss-4.73599910736084.pth
2023-03-26 08:16:58 - Epoch: 5, Step: 10/20, Avg Loss: 5.4179, Avg Regression Loss 2.6786, Avg Classification Loss: 2.7393
2023-03-26 08:17:19 - Epoch: 5, Training Loss: 4.9277, Training Regression Loss 2.4470, Training Classification Loss: 2.4807
You will notice that epoch 0 finishes in 30 sec but it then takes another 30sec to start epoch 1. I am now thinking if something is wrong in this AWS VM?
but it then takes another 30sec to start epoch 1
Is it taking 30 seconds to save the model? Normally it doesn't take that long
Hi
I have setup and installed
cuda 11.6
on aTesla T4 GPU
on a AWS machine. Below is the screenshot ofnvidia smi
I am training a model on around 700 images with batch size of 2 and workers 16. One epoch is taking around a minute to complete. Here is the log from the training
If we go with this speed, it looks like to complete 500 epochs, it can take more than 8hr which feels a bit slow as we have a very good GPU. Is there any way to speed up the training or may be we are missing something in our setup? Please suggest.
Here is the output of
gpustat
. GPU is always below 20%, it looks like we are not fully utilizing the hardware.