dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.89k stars 2.99k forks source link

Training SSD Mobilenet model outside and importing to run in Jetson Nano 2GB #1343

Closed Nikil-Shyamsunder closed 1 year ago

Nikil-Shyamsunder commented 2 years ago

@Dustin,

I downloaded a subset of classes and am training in my Jetson Nano. There are about 17000 images across 6 classes. It is taking about 50 minutes per epoch with a batch size of 4. That mean it will take me about 30 hours. That's a long time.

My questions are:

1) Is there anything else I can do to speed up in my nano?

2) Is there any way to train SSD mobilnet model outside, say in AWS, using a bigger and faster computer?

3) I tried doing in AWS ubuntu. I tried running the container and I also tried building the environment from scratch. It fails due to architecture. Any ideas? Is this not compatible?

4) If we do train somewhere else, how do I do it with the current scripts?

5) Once I train outside, can I simply bring the model directory and run the command $ python3 onnx_export.py --model-dir=models/mymodel

Please advise.

dusty-nv commented 2 years ago

Hi @Nikil-Shyamsunder, you can read some of my response to your question in this thread here: https://forums.developer.nvidia.com/t/please-help-nvidia-jetson-2gb-training-fails-typeerror-init-missing-1-required-positional-argument-dtype/201674/7?u=dusty_nv

It should be possible run in AWS GPU instance, but I haven't done it myself with AWS. I have an Ubuntu laptop here that I use.

You would not use the jetson-inference container on x86, rather you would use a container like the NGC PyTorch container (which are built for x86). Then mount/run the pytorch-ssd repo and run train_ssd.py like you normally would.

Regarding the ONNX export, you can do that either on the x86 machine or on your Jetson.

Nikil-Shyamsunder commented 2 years ago

Dustin,

I launched a VM in AWS. Installed boto3, pytorch, opencv etc. Downloaded pytorch-ssd repo and ran train_ssd.py like you normally would. The paaremeters are updated, so made some changes, this is the best i could do:


$ python ./open_images_downloader.py --class_names "Apple,Orange,Banana,Strawberry,Grape,Pear,Pineapple,Watermelon" --root=data/fruit

{ WORKED -- made changes to arguments }

$ python ./train_ssd.py --datasets=data/fruit --checkpoint_folder=models/fruit --batch_size=4 --num_epochs=30

2022-01-27 19:13:01,332 - root - INFO - Namespace(dataset_type='voc', datasets=['data/fruit'], validation_dataset=None, balance_data=False, net='vgg16-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.001, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=None, extra_layers_lr=None, base_net=None, pretrained_ssd=None, resume=None, scheduler='multi-step', milestones='80,100', t_max=120, batch_size=4, num_epochs=30, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/fruit')
2022-01-27 19:13:01,333 - root - INFO - Prepare training datasets.
Traceback (most recent call last):
  File "/home/ec2-user/pytorch-ssd/./train_ssd.py", line 210, in <module>
    dataset = VOCDataset(dataset_path, transform=train_transform,
  File "/home/ec2-user/pytorch-ssd/vision/datasets/voc_dataset.py", line 24, in __init__
    self.ids = VOCDataset._read_image_ids(image_sets_file)
  File "/home/ec2-user/pytorch-ssd/vision/datasets/voc_dataset.py", line 87, in _read_image_ids
    with open(image_sets_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/fruit/ImageSets/Main/trainval.txt'

Thoughts? I know I am not using VOC.

dusty-nv commented 2 years ago
FileNotFoundError: [Errno 2] No such file or directory: 'data/fruit/ImageSets/Main/trainval.txt'

Thoughts? I know I am not using VOC.

Hmm I'm not sure why it would default to VOC when the default dataset type is open_images: https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/train_ssd.py#L33

Can you try specifying --dataset-type=open_images ?

Nikil-Shyamsunder commented 2 years ago

Dustin,

Tried --dataset_type=open_images as shown below

$ ./train_ssd.py --datasets=data/fruit --checkpoint_folder=models/fruit --batch_size=4 --num_epochs=30  --dataset_type=open_images
2022-01-27 20:10:40,070 - root - INFO - Namespace(dataset_type='open_images', datasets=['data/fruit'], validation_dataset=None, balance_data=False, net='vgg16-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.001, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=None, extra_layers_lr=None, base_net=None, pretrained_ssd=None, resume=None, scheduler='multi-step', milestones='80,100', t_max=120, batch_size=4, num_epochs=30, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/fruit')
2022-01-27 20:10:40,072 - root - INFO - Prepare training datasets.
Traceback (most recent call last):
  File "/home/ec2-user/pytorch-ssd/./train_ssd.py", line 220, in <module>
    store_labels(label_file, dataset.class_names)
  File "/home/ec2-user/pytorch-ssd/vision/utils/misc.py", line 44, in store_labels
    with open(path, "w") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/fruit/open-images-model-labels.txt'

I noticed the default net is vgg16-ssd. Should it be something else?

parser.add_argument('--net', default="vgg16-ssd",
                    help="The network architecture, it can be mb1-ssd, mb1-lite-ssd, mb2-ssd-lite, mb3-large-ssd-lite, mb3-small-ssd-lite or vgg16-ssd.")

Tried even --net=mb1-lite-ssd as shown below

python ./train_ssd.py --datasets=data/fruit --checkpoint_folder=models/fruit --batch_size=4 --num_epochs=30  --dataset_type=open_images --net=mb1-lite-ssd
2022-01-27 20:14:46,280 - root - INFO - Namespace(dataset_type='open_images', datasets=['data/fruit'], validation_dataset=None, balance_data=False, net='mb1-lite-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.001, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=None, extra_layers_lr=None, base_net=None, pretrained_ssd=None, resume=None, scheduler='multi-step', milestones='80,100', t_max=120, batch_size=4, num_epochs=30, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/fruit')
2022-01-27 20:14:46,281 - root - CRITICAL - The net type is wrong.
usage: train_ssd.py [-h] [--dataset_type DATASET_TYPE] [--datasets DATASETS [DATASETS ...]] [--validation_dataset VALIDATION_DATASET] [--balance_data] [--net NET] [--freeze_base_net] [--freeze_net] [--mb2_width_mult MB2_WIDTH_MULT] [--lr LR]
                    [--momentum MOMENTUM] [--weight_decay WEIGHT_DECAY] [--gamma GAMMA] [--base_net_lr BASE_NET_LR] [--extra_layers_lr EXTRA_LAYERS_LR] [--base_net BASE_NET] [--pretrained_ssd PRETRAINED_SSD] [--resume RESUME] [--scheduler SCHEDULER]
                    [--milestones MILESTONES] [--t_max T_MAX] [--batch_size BATCH_SIZE] [--num_epochs NUM_EPOCHS] [--num_workers NUM_WORKERS] [--validation_epochs VALIDATION_EPOCHS] [--debug_steps DEBUG_STEPS] [--use_cuda USE_CUDA]
                    [--checkpoint_folder CHECKPOINT_FOLDER]

Single Shot MultiBox Detector Training With Pytorch

optional arguments:
  -h, --help            show this help message and exit
  --dataset_type DATASET_TYPE
                        Specify dataset type. Currently support voc and open_images.
  --datasets DATASETS [DATASETS ...]
                        Dataset directory path
  --validation_dataset VALIDATION_DATASET
                        Dataset directory path
  --balance_data        Balance training data by down-sampling more frequent labels.
  --net NET             The network architecture, it can be mb1-ssd, mb1-lite-ssd, mb2-ssd-lite, mb3-large-ssd-lite, mb3-small-ssd-lite or vgg16-ssd.
  --freeze_base_net     Freeze base net layers.
  --freeze_net          Freeze all the layers except the prediction head.
  --mb2_width_mult MB2_WIDTH_MULT
                        Width Multiplifier for MobilenetV2
  --lr LR, --learning-rate LR
                        initial learning rate
  --momentum MOMENTUM   Momentum value for optim
  --weight_decay WEIGHT_DECAY
                        Weight decay for SGD
  --gamma GAMMA         Gamma update for SGD
  --base_net_lr BASE_NET_LR
                        initial learning rate for base net.
  --extra_layers_lr EXTRA_LAYERS_LR
                        initial learning rate for the layers not in base net and prediction heads.
  --base_net BASE_NET   Pretrained base model
  --pretrained_ssd PRETRAINED_SSD
                        Pre-trained base model
  --resume RESUME       Checkpoint state_dict file to resume training from
  --scheduler SCHEDULER
                        Scheduler for SGD. It can one of multi-step and cosine
  --milestones MILESTONES
                        milestones for MultiStepLR
  --t_max T_MAX         T_max value for Cosine Annealing Scheduler.
  --batch_size BATCH_SIZE
                        Batch size for training
  --num_epochs NUM_EPOCHS
                        the number epochs
  --num_workers NUM_WORKERS
                        Number of workers used in dataloading
  --validation_epochs VALIDATION_EPOCHS
                        the number epochs
  --debug_steps DEBUG_STEPS
                        Set the debug log output frequency.
  --use_cuda USE_CUDA   Use CUDA to train model
  --checkpoint_folder CHECKPOINT_FOLDER
                        Directory for saving checkpoint models
dusty-nv commented 2 years ago

The default --net should be mb1-ssd: https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/train_ssd.py#L40

Does yours show differently?

Nikil-Shyamsunder commented 2 years ago

Dustin,

You caught the error. I had cloned the other repo where someone was performing the advanced work. I should have cloned yours. Once I clones your, it is working like a charm.

THANK YOU. I am loving the jetson nano.

RESOLVED.

dusty-nv commented 2 years ago

Aha, ok gotcha - great to hear! Glad you got it working :)