dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.85k stars 2.98k forks source link

Unable to train ssd mobilenet model #1099

Closed gitgkk closed 3 years ago

gitgkk commented 3 years ago

Hi @dusty-nv

As per your suggestion, I downloaded the docker and now using it. I tried running the train_ssd.py. But the command fails.

One of the difference that I noted was the presence of files in the Main folder. I just have default.txt with list of image names.

I have my own custom dataset of various helipad images. I used the cvat tool for annotation.

What should be the content of Main folder files? How should I resolve this error? I had multiple annotations in an image, I thought that would be a problem, then I reduced the dataset to 10 images with single annotation. Are there any restrictions as to how many annotation boxes are present in one image?

root@leapfrog:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/helipad --model-dir=models/helipad --batch-size=1 --workers=1 --epochs=1 2021-06-11 13:06:47 - Using CUDA... 2021-06-11 13:06:47 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/helipad', dataset_type='voc', datasets=['data/helipad'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2021-06-11 13:06:47 - Prepare training datasets. 2021-06-11 13:06:47 - VOC Labels read from file: ('BACKGROUND', '# label:color_rgb:parts:actions', 'Helipad:128,0,0::') 2021-06-11 13:06:47 - Stored labels into file models/helipad/labels.txt. 2021-06-11 13:06:47 - Train dataset size: 10 2021-06-11 13:06:47 - Prepare Validation datasets. 2021-06-11 13:06:47 - VOC Labels read from file: ('BACKGROUND', '# label:color_rgb:parts:actions', 'Helipad:128,0,0::') 2021-06-11 13:06:47 - Validation dataset size: 10 2021-06-11 13:06:47 - Build network. 2021-06-11 13:06:47 - Init from pretrained ssd models/mobilenet-v1-ssd-mp-0_675.pth 2021-06-11 13:06:48 - Took 0.51 seconds to load the model. 2021-06-11 13:07:39 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01. 2021-06-11 13:07:39 - Uses CosineAnnealingLR scheduler. 2021-06-11 13:07:39 - Start training from epoch 0. /usr/local/lib/python3.6/dist-packages/torch/optim/lr_scheduler.py:123: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning) warning - image 10 has object with unknown class 'Helipad' Traceback (most recent call last): File "train_ssd.py", line 343, in device=DEVICE, debug_steps=args.debug_steps, epoch=epoch) File "train_ssd.py", line 113, in train for i, data in enumerate(loader): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 363, in next data = self._next_data() File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 989, in _next_data return self._process_data(data) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1014, in _process_data data.reraise() File "/usr/local/lib/python3.6/dist-packages/torch/_utils.py", line 395, in reraise raise self.exc_type(msg) IndexError: Caught IndexError in DataLoader worker process 0. Original Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop data = fetcher.fetch(index) File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataset.py", line 207, in getitem return self.datasets[dataset_idx][sample_idx] File "/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 81, in getitem image, boxes, labels = self.transform(image, boxes, labels) File "/jetson-inference/python/training/detection/ssd/vision/ssd/data_preprocessing.py", line 34, in call return self.augment(img, boxes, labels) File "/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py", line 55, in call img, boxes, labels = t(img, boxes, labels) File "/jetson-inference/python/training/detection/ssd/vision/transforms/transforms.py", line 345, in call boxes[:, :2] += (int(left), int(top)) IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Thanks, Kashyap

dusty-nv commented 3 years ago

It appears that you need to make a new labels.txt in your dataset. By the looks of things it should have one line: Helipad

gitgkk commented 3 years ago

Hello @dusty-nv et All,

After failing to configure and build jetson-inference on my jetson nano, I picked up the docker version and installed it on system. I finally have a successful running detection model trained using PyTorch on my custom image dataset.

To all those readers who will land here searching a solution one day. Here's the list of steps that you need to perform for training on custom data in Pascal VOC format. I also took help from https://github.com/dusty-nv/jetson-inference/issues/789 and other issues.

  1. Make the following directory structure in /jetson-inference/python/training/detection/ssd/data/data-name, where "data-name" is your class/class group name, say fruits for a set of fruits.
    • Annotations/
      • Keep annotations files *.xml in this dir.
    • ImageSets/
      • Main
        • test.txt (Keep test jpg file names in this file without .jpg extension.)
        • train.txt (Keep training jpg file names in this file without .jpg extension.)
        • trainval.txt (Keep training validation jpg file names in this file without .jpg extension.)
        • val.txt (Keep validation jpg file names in this file without .jpg extension.)
    • JPEGImages/
      • *.jpg
    • labels.txt

More about this structure here https://programmer.help/blogs/tenor-flow-2.0-note-10-pascal-voc-data-set-introduction.html

  1. Use labelImg tool only to annotate your images. The reason for this is that there are fields in .xml annotation file, whose values must reflect your structure and specific values, else training will fail. See link in point 1 above.I had to change the xml notation so that you can see the values of these tags, else the editor didn't let me post it in right format. ">folder/folder< " ">truncated<0>/truncated<" ">difficult<0>/difficult<" ">path<data/data-name/JPEGImages/1.jpg>/path<" (This path must be the path from data dir to where your image file is including the file extension, .jpg in my case.)

  2. Now your data is ready for training. python3 train_ssd.py --dataset-type=voc --model-dir=models/data-name --data=data/data-name --pretrained-ssd=models/mobilenet-v1-ssd-mp-0_675.pth --batch-size=2 --num-epochs=400 --num-workers=2

You can configure, batch-size, epochs, num-workers as per your hardware. On nano, I used the above with comfort.

Pay attention to epochs because for ssd mobilenet it generates one .pth file of 25MBs. For 400 epochs, the program generated 10GB of .pth files. If you have more free space you can increase the number of epochs and this will improve the trained model accuracy. After running detection program for the first time, an optimized network will be generated, then you can delete the .pth files, if you don't have disk space.

  1. Export model: python3 onnx_export.py --model-dir=models/data-name

  2. Test via usb cam detectnet --model=models/helipad/ssd-mobilenet.onnx --labels=models/data-name/labels.txt --input-blob=input_0 --output-cvg=scores --output-bbox=boxes /dev/video0

I hope this helps.

Thanks, Kashyap