dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.82k stars 2.98k forks source link

SSD Mobilenet training slow on Tesla T4 GPU #1588

Open abhinavrawat27 opened 1 year ago

abhinavrawat27 commented 1 year ago

Hi

I have setup and installed cuda 11.6 on a Tesla T4 GPU on a AWS machine. Below is the screenshot of nvidia smi

Capture

I am training a model on around 700 images with batch size of 2 and workers 16. One epoch is taking around a minute to complete. Here is the log from the training

2023-03-24 09:37:15 - Using CUDA...
2023-03-24 09:37:15 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder='models/pincher240323', dataset_type='voc', datasets=['data/pincher240323'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1000, num_workers=16, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2023-03-24 09:37:23 - model resolution 300x300
2023-03-24 09:37:23 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2023-03-24 09:37:23 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2023-03-24 09:37:23 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2023-03-24 09:37:23 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2023-03-24 09:37:23 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2023-03-24 09:37:23 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2023-03-24 09:37:23 - Prepare training datasets.
2023-03-24 09:37:23 - VOC Labels read from file: ('BACKGROUND', 'pincher')
2023-03-24 09:37:23 - Stored labels into file models/pincher240323/labels.txt.
2023-03-24 09:37:23 - Train dataset size: 614
2023-03-24 09:37:23 - Prepare Validation datasets.
2023-03-24 09:37:23 - VOC Labels read from file: ('BACKGROUND', 'pincher')
2023-03-24 09:37:23 - Validation dataset size: 108
2023-03-24 09:37:23 - Build network.
2023-03-24 09:37:23 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2023-03-24 09:37:23 - Took 0.12 seconds to load the model.
2023-03-24 09:37:23 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2023-03-24 09:37:23 - Uses CosineAnnealingLR scheduler.
2023-03-24 09:37:23 - Start training from epoch 0.
2023-03-24 09:37:30 - Epoch: 0, Step: 10/307, Avg Loss: 17.9656, Avg Regression Loss 10.6628, Avg Classification Loss: 7.3028
2023-03-24 09:37:31 - Epoch: 0, Step: 20/307, Avg Loss: 16.6615, Avg Regression Loss 7.2192, Avg Classification Loss: 9.4423
2023-03-24 09:37:32 - Epoch: 0, Step: 30/307, Avg Loss: 14.7038, Avg Regression Loss 8.0701, Avg Classification Loss: 6.6337
2023-03-24 09:37:33 - Epoch: 0, Step: 40/307, Avg Loss: 11.5384, Avg Regression Loss 6.7426, Avg Classification Loss: 4.7958
2023-03-24 09:37:34 - Epoch: 0, Step: 50/307, Avg Loss: 10.4441, Avg Regression Loss 6.5290, Avg Classification Loss: 3.9151
2023-03-24 09:37:37 - Epoch: 0, Step: 60/307, Avg Loss: 8.3874, Avg Regression Loss 4.7442, Avg Classification Loss: 3.6432
2023-03-24 09:37:39 - Epoch: 0, Step: 70/307, Avg Loss: 9.9555, Avg Regression Loss 5.4930, Avg Classification Loss: 4.4624
2023-03-24 09:37:40 - Epoch: 0, Step: 80/307, Avg Loss: 12.3795, Avg Regression Loss 7.5978, Avg Classification Loss: 4.7817
2023-03-24 09:37:42 - Epoch: 0, Step: 90/307, Avg Loss: 15.1853, Avg Regression Loss 8.5252, Avg Classification Loss: 6.6601
2023-03-24 09:37:44 - Epoch: 0, Step: 100/307, Avg Loss: 12.5090, Avg Regression Loss 7.8703, Avg Classification Loss: 4.6387
2023-03-24 09:37:45 - Epoch: 0, Step: 110/307, Avg Loss: 9.1335, Avg Regression Loss 5.7876, Avg Classification Loss: 3.3460
2023-03-24 09:37:47 - Epoch: 0, Step: 120/307, Avg Loss: 9.6612, Avg Regression Loss 6.5748, Avg Classification Loss: 3.0864
2023-03-24 09:37:48 - Epoch: 0, Step: 130/307, Avg Loss: 8.4827, Avg Regression Loss 5.4834, Avg Classification Loss: 2.9993
2023-03-24 09:37:49 - Epoch: 0, Step: 140/307, Avg Loss: 10.2324, Avg Regression Loss 6.7242, Avg Classification Loss: 3.5083
2023-03-24 09:37:52 - Epoch: 0, Step: 150/307, Avg Loss: 7.5305, Avg Regression Loss 4.7914, Avg Classification Loss: 2.7391
2023-03-24 09:37:53 - Epoch: 0, Step: 160/307, Avg Loss: 8.8777, Avg Regression Loss 6.2293, Avg Classification Loss: 2.6484
2023-03-24 09:37:54 - Epoch: 0, Step: 170/307, Avg Loss: 6.9029, Avg Regression Loss 4.2739, Avg Classification Loss: 2.6290
2023-03-24 09:37:55 - Epoch: 0, Step: 180/307, Avg Loss: 8.1169, Avg Regression Loss 5.5383, Avg Classification Loss: 2.5786
2023-03-24 09:37:59 - Epoch: 0, Step: 190/307, Avg Loss: 8.2192, Avg Regression Loss 5.4100, Avg Classification Loss: 2.8092
2023-03-24 09:38:00 - Epoch: 0, Step: 200/307, Avg Loss: 9.2362, Avg Regression Loss 6.1468, Avg Classification Loss: 3.0894
2023-03-24 09:38:02 - Epoch: 0, Step: 210/307, Avg Loss: 8.0101, Avg Regression Loss 5.0695, Avg Classification Loss: 2.9406
2023-03-24 09:38:02 - Epoch: 0, Step: 220/307, Avg Loss: 7.9031, Avg Regression Loss 4.7527, Avg Classification Loss: 3.1504
2023-03-24 09:38:04 - Epoch: 0, Step: 230/307, Avg Loss: 7.8145, Avg Regression Loss 5.0483, Avg Classification Loss: 2.7662
2023-03-24 09:38:05 - Epoch: 0, Step: 240/307, Avg Loss: 6.9392, Avg Regression Loss 3.9858, Avg Classification Loss: 2.9534
2023-03-24 09:38:06 - Epoch: 0, Step: 250/307, Avg Loss: 8.5067, Avg Regression Loss 5.7810, Avg Classification Loss: 2.7257
2023-03-24 09:38:07 - Epoch: 0, Step: 260/307, Avg Loss: 7.0704, Avg Regression Loss 4.5292, Avg Classification Loss: 2.5412
2023-03-24 09:38:08 - Epoch: 0, Step: 270/307, Avg Loss: 7.6444, Avg Regression Loss 5.0744, Avg Classification Loss: 2.5700
2023-03-24 09:38:12 - Epoch: 0, Step: 280/307, Avg Loss: 7.3607, Avg Regression Loss 4.7707, Avg Classification Loss: 2.5900
2023-03-24 09:38:13 - Epoch: 0, Step: 290/307, Avg Loss: 8.7418, Avg Regression Loss 5.9623, Avg Classification Loss: 2.7795
2023-03-24 09:38:14 - Epoch: 0, Step: 300/307, Avg Loss: 7.3707, Avg Regression Loss 4.7500, Avg Classification Loss: 2.6207
2023-03-24 09:38:14 - Epoch: 0, Training Loss: 9.7103, Training Regression Loss 5.9685, Training Classification Loss: 3.7418
2023-03-24 09:38:17 - Epoch: 0, Validation Loss: 7.7446, Validation Regression Loss 5.1529, Validation Classification Loss: 2.5917
2023-03-24 09:38:17 - Saved model models/pincher240323/mb1-ssd-Epoch-0-Loss-7.7445596677285655.pth
2023-03-24 09:38:23 - Epoch: 1, Step: 10/307, Avg Loss: 8.4229, Avg Regression Loss 5.5593, Avg Classification Loss: 2.8636
2023-03-24 09:38:24 - Epoch: 1, Step: 20/307, Avg Loss: 6.9926, Avg Regression Loss 4.4504, Avg Classification Loss: 2.5422
2023-03-24 09:38:25 - Epoch: 1, Step: 30/307, Avg Loss: 7.4343, Avg Regression Loss 4.8034, Avg Classification Loss: 2.6310
2023-03-24 09:38:28 - Epoch: 1, Step: 40/307, Avg Loss: 8.6699, Avg Regression Loss 6.0319, Avg Classification Loss: 2.6380
2023-03-24 09:38:29 - Epoch: 1, Step: 50/307, Avg Loss: 6.1574, Avg Regression Loss 3.7030, Avg Classification Loss: 2.4544
2023-03-24 09:38:30 - Epoch: 1, Step: 60/307, Avg Loss: 6.5575, Avg Regression Loss 4.0316, Avg Classification Loss: 2.5260
2023-03-24 09:38:31 - Epoch: 1, Step: 70/307, Avg Loss: 6.9045, Avg Regression Loss 4.3308, Avg Classification Loss: 2.5737
2023-03-24 09:38:35 - Epoch: 1, Step: 80/307, Avg Loss: 7.3461, Avg Regression Loss 4.7252, Avg Classification Loss: 2.6210
2023-03-24 09:38:35 - Epoch: 1, Step: 90/307, Avg Loss: 7.3962, Avg Regression Loss 4.8551, Avg Classification Loss: 2.5411
2023-03-24 09:38:36 - Epoch: 1, Step: 100/307, Avg Loss: 6.9389, Avg Regression Loss 4.3607, Avg Classification Loss: 2.5781
2023-03-24 09:38:37 - Epoch: 1, Step: 110/307, Avg Loss: 7.1740, Avg Regression Loss 4.4376, Avg Classification Loss: 2.7364
2023-03-24 09:38:39 - Epoch: 1, Step: 120/307, Avg Loss: 7.3468, Avg Regression Loss 4.7516, Avg Classification Loss: 2.5953
2023-03-24 09:38:41 - Epoch: 1, Step: 130/307, Avg Loss: 7.1751, Avg Regression Loss 4.4897, Avg Classification Loss: 2.6854
2023-03-24 09:38:42 - Epoch: 1, Step: 140/307, Avg Loss: 6.6076, Avg Regression Loss 4.0488, Avg Classification Loss: 2.5589
2023-03-24 09:38:43 - Epoch: 1, Step: 150/307, Avg Loss: 6.7330, Avg Regression Loss 4.0907, Avg Classification Loss: 2.6423
2023-03-24 09:38:45 - Epoch: 1, Step: 160/307, Avg Loss: 6.6290, Avg Regression Loss 4.1192, Avg Classification Loss: 2.5098
2023-03-24 09:38:46 - Epoch: 1, Step: 170/307, Avg Loss: 7.1640, Avg Regression Loss 4.5678, Avg Classification Loss: 2.5963
2023-03-24 09:38:49 - Epoch: 1, Step: 180/307, Avg Loss: 7.0627, Avg Regression Loss 4.4285, Avg Classification Loss: 2.6341
2023-03-24 09:38:50 - Epoch: 1, Step: 190/307, Avg Loss: 6.9038, Avg Regression Loss 4.3230, Avg Classification Loss: 2.5808
2023-03-24 09:38:51 - Epoch: 1, Step: 200/307, Avg Loss: 8.2522, Avg Regression Loss 4.8174, Avg Classification Loss: 3.4348
2023-03-24 09:38:52 - Epoch: 1, Step: 210/307, Avg Loss: 7.0344, Avg Regression Loss 4.3906, Avg Classification Loss: 2.6437
2023-03-24 09:38:55 - Epoch: 1, Step: 220/307, Avg Loss: 6.1428, Avg Regression Loss 3.6393, Avg Classification Loss: 2.5036
2023-03-24 09:38:57 - Epoch: 1, Step: 230/307, Avg Loss: 8.4899, Avg Regression Loss 5.6340, Avg Classification Loss: 2.8559
2023-03-24 09:38:58 - Epoch: 1, Step: 240/307, Avg Loss: 7.3100, Avg Regression Loss 4.6599, Avg Classification Loss: 2.6502
2023-03-24 09:38:59 - Epoch: 1, Step: 250/307, Avg Loss: 7.3448, Avg Regression Loss 4.1527, Avg Classification Loss: 3.1921
2023-03-24 09:39:00 - Epoch: 1, Step: 260/307, Avg Loss: 6.5643, Avg Regression Loss 3.9491, Avg Classification Loss: 2.6152
2023-03-24 09:39:02 - Epoch: 1, Step: 270/307, Avg Loss: 7.0943, Avg Regression Loss 4.4077, Avg Classification Loss: 2.6866
2023-03-24 09:39:05 - Epoch: 1, Step: 280/307, Avg Loss: 6.5084, Avg Regression Loss 3.9660, Avg Classification Loss: 2.5425
2023-03-24 09:39:05 - Epoch: 1, Step: 290/307, Avg Loss: 7.2297, Avg Regression Loss 4.6611, Avg Classification Loss: 2.5686
2023-03-24 09:39:06 - Epoch: 1, Step: 300/307, Avg Loss: 7.4744, Avg Regression Loss 4.7703, Avg Classification Loss: 2.7041
2023-03-24 09:39:06 - Epoch: 1, Training Loss: 7.1441, Training Regression Loss 4.4870, Training Classification Loss: 2.6571
2023-03-24 09:39:09 - Epoch: 1, Validation Loss: 6.1573, Validation Regression Loss 3.4564, Validation Classification Loss: 2.7009
2023-03-24 09:39:09 - Saved model models/pincher240323/mb1-ssd-Epoch-1-Loss-6.157338831159803.pth

If we go with this speed, it looks like to complete 500 epochs, it can take more than 8hr which feels a bit slow as we have a very good GPU. Is there any way to speed up the training or may be we are missing something in our setup? Please suggest.

Here is the output of gpustat. GPU is always below 20%, it looks like we are not fully utilizing the hardware.

[0] Tesla T4         | 47'C,   9 % |  2869 / 15360 MB | ubuntu(2417M)
[0] Tesla T4         | 47'C,  19 % |  2869 / 15360 MB | ubuntu(2417M)
[0] Tesla T4         | 47'C,   0 % |  2869 / 15360 MB | ubuntu(2417M)
[0] Tesla T4         | 48'C,  15 % |  2869 / 15360 MB | ubuntu(2417M)
[0] Tesla T4         | 48'C,  14 % |  2869 / 15360 MB | ubuntu(2417M)
[0] Tesla T4         | 48'C,  20 % |  2869 / 15360 MB | ubuntu(2417M)
[0] Tesla T4         | 48'C,   0 % |  2869 / 15360 MB | ubuntu(2417M)
dusty-nv commented 1 year ago

Hi @abhinavrawat27, you have 16GB of GPU memory, have you tried increasing the batch size.

Also I don't typically train over 100 epochs for SSD-Mobilenet on datasets of that size, for risk of overfitting and just overall diminishing returns. The learning rate scheduler and things like that may need adjusted, I'm not sure.

abhinavrawat27 commented 1 year ago

Hi @dusty-nv I will try increasing the batch size but can you explain a bit more about 2nd point you made. How can we adjust the learning rate scheduler and what values please? Thanks

abhinavrawat27 commented 1 year ago

Hi @dusty-nv I used 32 batch size and it still feels very slow.

2023-03-26 08:10:34 - Using CUDA...
2023-03-26 08:10:34 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=32, checkpoint_folder='models/pincher', dataset_type='open_images', datasets=['data/pincher240323'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=300, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2023-03-26 08:10:42 - model resolution 300x300
2023-03-26 08:10:42 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2023-03-26 08:10:42 - Prepare training datasets.
2023-03-26 08:10:42 - loading annotations from: data/pincher240323/sub-train-annotations-bbox.csv
2023-03-26 08:11:22 - Using CUDA...
2023-03-26 08:11:22 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=32, checkpoint_folder='models/pincher', dataset_type='voc', datasets=['data/pincher240323'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=300, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005)
2023-03-26 08:11:26 - model resolution 300x300
2023-03-26 08:11:26 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3])
2023-03-26 08:11:26 - Prepare training datasets.
2023-03-26 08:11:26 - VOC Labels read from file: ('BACKGROUND', 'pincher')
2023-03-26 08:11:26 - Stored labels into file models/pincher/labels.txt.
2023-03-26 08:11:26 - Train dataset size: 614
2023-03-26 08:11:26 - Prepare Validation datasets.
2023-03-26 08:11:26 - VOC Labels read from file: ('BACKGROUND', 'pincher')
2023-03-26 08:11:26 - Validation dataset size: 108
2023-03-26 08:11:26 - Build network.
2023-03-26 08:11:26 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth
2023-03-26 08:11:26 - Took 0.12 seconds to load the model.
2023-03-26 08:11:26 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2023-03-26 08:11:26 - Uses CosineAnnealingLR scheduler.
2023-03-26 08:11:26 - Start training from epoch 0.
2023-03-26 08:12:02 - Epoch: 0, Step: 10/20, Avg Loss: 12.8133, Avg Regression Loss 8.0553, Avg Classification Loss: 4.7580
2023-03-26 08:12:26 - Epoch: 0, Training Loss: 10.2453, Training Regression Loss 6.4478, Training Classification Loss: 3.7974
2023-03-26 08:12:30 - Epoch: 0, Validation Loss: 7.7618, Validation Regression Loss 4.6914, Validation Classification Loss: 3.0704
2023-03-26 08:12:30 - Saved model models/pincher/mb1-ssd-Epoch-0-Loss-7.761813521385193.pth
2023-03-26 08:13:01 - Epoch: 1, Step: 10/20, Avg Loss: 7.8498, Avg Regression Loss 4.3744, Avg Classification Loss: 3.4754
2023-03-26 08:13:24 - Epoch: 1, Training Loss: 6.9818, Training Regression Loss 3.8846, Training Classification Loss: 3.0972
2023-03-26 08:13:27 - Epoch: 1, Validation Loss: 7.2093, Validation Regression Loss 4.0637, Validation Classification Loss: 3.1456
2023-03-26 08:13:28 - Saved model models/pincher/mb1-ssd-Epoch-1-Loss-7.209309101104736.pth
2023-03-26 08:13:59 - Epoch: 2, Step: 10/20, Avg Loss: 6.7346, Avg Regression Loss 3.5297, Avg Classification Loss: 3.2049
2023-03-26 08:14:23 - Epoch: 2, Training Loss: 5.9695, Training Regression Loss 3.1388, Training Classification Loss: 2.8307
2023-03-26 08:14:26 - Epoch: 2, Validation Loss: 5.3872, Validation Regression Loss 2.8208, Validation Classification Loss: 2.5664
2023-03-26 08:14:26 - Saved model models/pincher/mb1-ssd-Epoch-2-Loss-5.3872374296188354.pth
2023-03-26 08:14:55 - Epoch: 3, Step: 10/20, Avg Loss: 6.1297, Avg Regression Loss 3.2159, Avg Classification Loss: 2.9138
2023-03-26 08:15:21 - Epoch: 3, Training Loss: 5.4523, Training Regression Loss 2.8301, Training Classification Loss: 2.6222
2023-03-26 08:15:24 - Epoch: 3, Validation Loss: 5.7545, Validation Regression Loss 2.8110, Validation Classification Loss: 2.9436
2023-03-26 08:15:24 - Saved model models/pincher/mb1-ssd-Epoch-3-Loss-5.754525423049927.pth
2023-03-26 08:15:59 - Epoch: 4, Step: 10/20, Avg Loss: 5.8826, Avg Regression Loss 2.9114, Avg Classification Loss: 2.9712
2023-03-26 08:16:23 - Epoch: 4, Training Loss: 5.2831, Training Regression Loss 2.6439, Training Classification Loss: 2.6392
2023-03-26 08:16:27 - Epoch: 4, Validation Loss: 4.7360, Validation Regression Loss 2.2415, Validation Classification Loss: 2.4945
2023-03-26 08:16:27 - Saved model models/pincher/mb1-ssd-Epoch-4-Loss-4.73599910736084.pth
2023-03-26 08:16:58 - Epoch: 5, Step: 10/20, Avg Loss: 5.4179, Avg Regression Loss 2.6786, Avg Classification Loss: 2.7393
2023-03-26 08:17:19 - Epoch: 5, Training Loss: 4.9277, Training Regression Loss 2.4470, Training Classification Loss: 2.4807

You will notice that epoch 0 finishes in 30 sec but it then takes another 30sec to start epoch 1. I am now thinking if something is wrong in this AWS VM?

dusty-nv commented 1 year ago

but it then takes another 30sec to start epoch 1

Is it taking 30 seconds to save the model? Normally it doesn't take that long