Closed Nikil-Shyamsunder closed 1 year ago
Hi @Nikil-Shyamsunder, you can read some of my response to your question in this thread here: https://forums.developer.nvidia.com/t/please-help-nvidia-jetson-2gb-training-fails-typeerror-init-missing-1-required-positional-argument-dtype/201674/7?u=dusty_nv
It should be possible run in AWS GPU instance, but I haven't done it myself with AWS. I have an Ubuntu laptop here that I use.
You would not use the jetson-inference container on x86, rather you would use a container like the NGC PyTorch container (which are built for x86). Then mount/run the pytorch-ssd repo and run train_ssd.py like you normally would.
Regarding the ONNX export, you can do that either on the x86 machine or on your Jetson.
Dustin,
I launched a VM in AWS. Installed boto3, pytorch, opencv etc. Downloaded pytorch-ssd repo and ran train_ssd.py like you normally would. The paaremeters are updated, so made some changes, this is the best i could do:
$ python ./open_images_downloader.py --class_names "Apple,Orange,Banana,Strawberry,Grape,Pear,Pineapple,Watermelon" --root=data/fruit
{ WORKED -- made changes to arguments }
$ python ./train_ssd.py --datasets=data/fruit --checkpoint_folder=models/fruit --batch_size=4 --num_epochs=30
2022-01-27 19:13:01,332 - root - INFO - Namespace(dataset_type='voc', datasets=['data/fruit'], validation_dataset=None, balance_data=False, net='vgg16-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.001, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=None, extra_layers_lr=None, base_net=None, pretrained_ssd=None, resume=None, scheduler='multi-step', milestones='80,100', t_max=120, batch_size=4, num_epochs=30, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/fruit')
2022-01-27 19:13:01,333 - root - INFO - Prepare training datasets.
Traceback (most recent call last):
File "/home/ec2-user/pytorch-ssd/./train_ssd.py", line 210, in <module>
dataset = VOCDataset(dataset_path, transform=train_transform,
File "/home/ec2-user/pytorch-ssd/vision/datasets/voc_dataset.py", line 24, in __init__
self.ids = VOCDataset._read_image_ids(image_sets_file)
File "/home/ec2-user/pytorch-ssd/vision/datasets/voc_dataset.py", line 87, in _read_image_ids
with open(image_sets_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: 'data/fruit/ImageSets/Main/trainval.txt'
Thoughts? I know I am not using VOC.
FileNotFoundError: [Errno 2] No such file or directory: 'data/fruit/ImageSets/Main/trainval.txt'
Thoughts? I know I am not using VOC.
Hmm I'm not sure why it would default to VOC when the default dataset type is open_images: https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/train_ssd.py#L33
Can you try specifying --dataset-type=open_images
?
Dustin,
$ ./train_ssd.py --datasets=data/fruit --checkpoint_folder=models/fruit --batch_size=4 --num_epochs=30 --dataset_type=open_images
2022-01-27 20:10:40,070 - root - INFO - Namespace(dataset_type='open_images', datasets=['data/fruit'], validation_dataset=None, balance_data=False, net='vgg16-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.001, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=None, extra_layers_lr=None, base_net=None, pretrained_ssd=None, resume=None, scheduler='multi-step', milestones='80,100', t_max=120, batch_size=4, num_epochs=30, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/fruit')
2022-01-27 20:10:40,072 - root - INFO - Prepare training datasets.
Traceback (most recent call last):
File "/home/ec2-user/pytorch-ssd/./train_ssd.py", line 220, in <module>
store_labels(label_file, dataset.class_names)
File "/home/ec2-user/pytorch-ssd/vision/utils/misc.py", line 44, in store_labels
with open(path, "w") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'models/fruit/open-images-model-labels.txt'
parser.add_argument('--net', default="vgg16-ssd",
help="The network architecture, it can be mb1-ssd, mb1-lite-ssd, mb2-ssd-lite, mb3-large-ssd-lite, mb3-small-ssd-lite or vgg16-ssd.")
python ./train_ssd.py --datasets=data/fruit --checkpoint_folder=models/fruit --batch_size=4 --num_epochs=30 --dataset_type=open_images --net=mb1-lite-ssd
2022-01-27 20:14:46,280 - root - INFO - Namespace(dataset_type='open_images', datasets=['data/fruit'], validation_dataset=None, balance_data=False, net='mb1-lite-ssd', freeze_base_net=False, freeze_net=False, mb2_width_mult=1.0, lr=0.001, momentum=0.9, weight_decay=0.0005, gamma=0.1, base_net_lr=None, extra_layers_lr=None, base_net=None, pretrained_ssd=None, resume=None, scheduler='multi-step', milestones='80,100', t_max=120, batch_size=4, num_epochs=30, num_workers=4, validation_epochs=5, debug_steps=100, use_cuda=True, checkpoint_folder='models/fruit')
2022-01-27 20:14:46,281 - root - CRITICAL - The net type is wrong.
usage: train_ssd.py [-h] [--dataset_type DATASET_TYPE] [--datasets DATASETS [DATASETS ...]] [--validation_dataset VALIDATION_DATASET] [--balance_data] [--net NET] [--freeze_base_net] [--freeze_net] [--mb2_width_mult MB2_WIDTH_MULT] [--lr LR]
[--momentum MOMENTUM] [--weight_decay WEIGHT_DECAY] [--gamma GAMMA] [--base_net_lr BASE_NET_LR] [--extra_layers_lr EXTRA_LAYERS_LR] [--base_net BASE_NET] [--pretrained_ssd PRETRAINED_SSD] [--resume RESUME] [--scheduler SCHEDULER]
[--milestones MILESTONES] [--t_max T_MAX] [--batch_size BATCH_SIZE] [--num_epochs NUM_EPOCHS] [--num_workers NUM_WORKERS] [--validation_epochs VALIDATION_EPOCHS] [--debug_steps DEBUG_STEPS] [--use_cuda USE_CUDA]
[--checkpoint_folder CHECKPOINT_FOLDER]
Single Shot MultiBox Detector Training With Pytorch
optional arguments:
-h, --help show this help message and exit
--dataset_type DATASET_TYPE
Specify dataset type. Currently support voc and open_images.
--datasets DATASETS [DATASETS ...]
Dataset directory path
--validation_dataset VALIDATION_DATASET
Dataset directory path
--balance_data Balance training data by down-sampling more frequent labels.
--net NET The network architecture, it can be mb1-ssd, mb1-lite-ssd, mb2-ssd-lite, mb3-large-ssd-lite, mb3-small-ssd-lite or vgg16-ssd.
--freeze_base_net Freeze base net layers.
--freeze_net Freeze all the layers except the prediction head.
--mb2_width_mult MB2_WIDTH_MULT
Width Multiplifier for MobilenetV2
--lr LR, --learning-rate LR
initial learning rate
--momentum MOMENTUM Momentum value for optim
--weight_decay WEIGHT_DECAY
Weight decay for SGD
--gamma GAMMA Gamma update for SGD
--base_net_lr BASE_NET_LR
initial learning rate for base net.
--extra_layers_lr EXTRA_LAYERS_LR
initial learning rate for the layers not in base net and prediction heads.
--base_net BASE_NET Pretrained base model
--pretrained_ssd PRETRAINED_SSD
Pre-trained base model
--resume RESUME Checkpoint state_dict file to resume training from
--scheduler SCHEDULER
Scheduler for SGD. It can one of multi-step and cosine
--milestones MILESTONES
milestones for MultiStepLR
--t_max T_MAX T_max value for Cosine Annealing Scheduler.
--batch_size BATCH_SIZE
Batch size for training
--num_epochs NUM_EPOCHS
the number epochs
--num_workers NUM_WORKERS
Number of workers used in dataloading
--validation_epochs VALIDATION_EPOCHS
the number epochs
--debug_steps DEBUG_STEPS
Set the debug log output frequency.
--use_cuda USE_CUDA Use CUDA to train model
--checkpoint_folder CHECKPOINT_FOLDER
Directory for saving checkpoint models
The default --net
should be mb1-ssd: https://github.com/dusty-nv/pytorch-ssd/blob/8ed842a408f8c4a8812f430cf8063e0b93a56803/train_ssd.py#L40
Does yours show differently?
Dustin,
You caught the error. I had cloned the other repo where someone was performing the advanced work. I should have cloned yours. Once I clones your, it is working like a charm.
THANK YOU. I am loving the jetson nano.
RESOLVED.
Aha, ok gotcha - great to hear! Glad you got it working :)
@Dustin,
I downloaded a subset of classes and am training in my Jetson Nano. There are about 17000 images across 6 classes. It is taking about 50 minutes per epoch with a batch size of 4. That mean it will take me about 30 hours. That's a long time.
My questions are:
1) Is there anything else I can do to speed up in my nano?
2) Is there any way to train SSD mobilnet model outside, say in AWS, using a bigger and faster computer?
3) I tried doing in AWS ubuntu. I tried running the container and I also tried building the environment from scratch. It fails due to architecture. Any ideas? Is this not compatible?
4) If we do train somewhere else, how do I do it with the current scripts?
5) Once I train outside, can I simply bring the model directory and run the command $ python3 onnx_export.py --model-dir=models/mymodel
Please advise.