dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.86k stars 2.98k forks source link

training ssd-mobilenet from custom dataset #789

Closed Hasandaoud closed 1 year ago

Hasandaoud commented 4 years ago

hi @dusty-nv i want to train my own dataset to detect emotions using ssd-mobilenet, iam using labelimg to label my images (pascal voc) i put my dataset in ssd/data command to run training: python3 train_ssd.py --dataset-type=voc --data=data/14emotions --model-dir=models/14emotion --epochs=10

output: 2020-11-06 18:40:47 - Using CUDA... 2020-11-06 18:40:47 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/14emotion', dataset_type='voc', datasets=['data/14emotions'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=10, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2020-11-06 18:40:47 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 214, in target_transform=target_transform) File "/home/daoud/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 33, in init raise IOError("missing ImageSet file {:s}".format(image_sets_file)) TypeError: unsupported format string passed to PosixPath.format

what should i do to solve it? thank you.

dusty-nv commented 4 years ago

Hi @abu3ali , you will need to make sure that your dataset looks like Pascal VOC format and directory structure:

- Annotations/
      - *.xml
- ImageSets/
      - Main
            - test.txt
            - train.txt
            - trainval.txt
            - val.txt
- JPEGImages/
      - *.jpg
- labels.txt

It looks like right now you are missing the test.txt, train.txt, trainval.txt, val.txt under ImageSets/Main.

You can create a text file with your list of ImageIDs and copy it to those text files. It should look something like this, where each line is an ImageID (without the .jpg extension):

20200917-162237
20200917-162252
20200917-162314
20200917-162332
20200917-162351
20200917-162414
20200917-162429
20200917-162447
...

You also need to create your labels.txt file.

It seems that LabelImg tool doesn't create the full structure, but the CVAT tool is closer, so I have changed the docs to recommend CVAT going forward. However with moving your annotations/images around and re-creating the VOC directory structure, LabelImg works fine too.

When in doubt, you can download the original Pascal VOC dataset and look how it is structured.

silent-code commented 3 years ago

Indeed, the labelImg tool does not create the full file structure. You need to do the following:

  1. First create a label.txt file in the jetson-inference/python/training/detection/ssd directory (this labels.txt file should NOT have the BACKGROUND class listed, just the classes you want to train on e.g., dog, cat, rhino)

  2. Place the following sub-directories populated by the labelImage tool outputs in the same directory: Annotations, JPEGImages

  3. Then enter the following command from the ssd directory: python vision/datasets/generate_vocdata.py ./labels.txt

Now you are ready to train with the mb1 pretrained network: python3 train_ssd.py --dataset-type=voc --model-dir=models/my-models-voc --data=./ --pretrained-ssd='models/mobilenet-v1-ssd-mp-0_675.pth' --batch-size=4 --num-epochs=50

Then convert your trained model to onyx (delete the labels.txt file in the ssd directory since the above step creates for you a labels.txt file in the directory specified by --input : python3 onnx_export.py --input="./models/my-models-voc/name-of-model-you-want-to-convert.pth" --model-dir=models/my-models-voc

Good luck!

07hokage commented 3 years ago

Hello @dusty-nv and @silent-code Thanks for the inputs. I was able to train the model and save it too ( I resized all the training images to 300x300 before the training ). But while converting to onyx, i'm getting the following error image

The mismatch in size occurs while converting. In the code, the dummy_input was created usingdummy_input = torch.randn(args.batch-size, 3, args.height, args.width).cuda() . Default height, width =300,300. So it doesn't make any sense as to why it was trying to convert to model with different tensor size. Any thoughts on this ?

silent-code commented 3 years ago

Try deleting the labels.txt file created for training in the jetson-inference/python/training/detection/ssd directory. The onnx_export script will look in the models/your-model folder for the correct labels.txt file containing the BACKGROUND class listing.

07hokage commented 3 years ago

@silent-code Thank you very much. That worked !!!!!

silent-code commented 3 years ago

Glad to help!

Also remember, if you retrain afterward with more or different data for the same model, delete or rename the existing 'mb2-ssd-lite.onyx' and onyx engine file 'mb2-ssd-lite.onnx.1.1.7100.GPU.FP16.engine' in the models/your-model directory, before the onnx_export step. This will force detectnet to recompile the model at runtime with the new retrained weights.

07hokage commented 3 years ago

Sure. Will keep that in mind.

dusty-nv commented 3 years ago

Can you try setting IMAGES to this instead:

IMAGES=/home/jetson/jetson-inference/data/images

If there is still error, please post the terminal log here.

07hokage commented 3 years ago

@dusty-nv , i'm having problem with running inference on camera stream. I'm running this command. python3 detectnet.py --model=models/mymodel/ssd-mobilenet.onnx --input-blob=input_0 --output-cvg=scores --output-bbox=boxes csi://0 . sometimes script just ends with Segmentation fault (core dumped) right at this line. Here is the console log log.txt. some times, the stream opens and as soon as any detection is found, it stops saying Segementation fault.

While the inference runs well when image/set of images are given. Is this related to image format coming from the camera?? I tried using usb camera too but of no use as it also exhibit same behaviour as of that of csi camera inference.. Any ways to resolve this ?

mjack3 commented 3 years ago

@07hokage i have the same issue. Did you solve it? Thanks

07hokage commented 3 years ago

@mjack3 is it regarding the inference on the live video stream ?

goodmorningcoffee commented 3 years ago

hi! I get the same error. after following help from here, I still have issues.

I'm using a custom data set, and had same issue as OP where my directory was not proper format.

I followed @dusty-nv fix, made proper directories, added .txt files.

however, I cannot figure out how to populate the labels.txt with the list of ImageIDs ?? I'm super noob :\
is there a command from terminal I can run to do this?

once done, I copy and paste this list into the other .txt docs in ImageSet/Main ?

thank you!!!


*I used RectLabel to create my data set, not LabelImg

07hokage commented 3 years ago

hi! I get the same error. after following help from here, I still have issues.

I'm using a custom data set, and had same issue as OP where my directory was not proper format.

I followed @dusty-nv fix, made proper directories, added .txt files.

however, I cannot figure out how to populate the labels.txt with the list of ImageIDs ?? I'm super noob :\ is there a command from terminal I can run to do this?

once done, I copy and paste this list into the other .txt docs in ImageSet/Main ?

thank you!!!

*I used RectLabel to create my data set, not LabelImg

Labels.txt is supposed to contain the names of the classes that you are traning the model to recognise them. And not the image ids

goodmorningcoffee commented 3 years ago

hi! I get the same error. after following help from here, I still have issues. I'm using a custom data set, and had same issue as OP where my directory was not proper format. I followed @dusty-nv fix, made proper directories, added .txt files. however, I cannot figure out how to populate the labels.txt with the list of ImageIDs ?? I'm super noob : is there a command from terminal I can run to do this? once done, I copy and paste this list into the other .txt docs in ImageSet/Main ? thank you!!! *I used RectLabel to create my data set, not LabelImg

Labels.txt is supposed to contain the names of the classes that you are traning the model to recognise them. And not the image ids

yes, you're right.

I put my classes in labels.txt.

however, aren't I suppose to create a list of the ImageIDs and copy that to the .txt files in ImageSets/Main ? Right now, my test.txt etc are empty.

I get this error:

user@jetson: /Downloads/jetson-inference/python/training/detection/ssd$ python3 train_ssd.py --dataset-type=voc --data=/Downloads/jetson-inference/python/training/detection/ssd/data/shapes --model-dir=models/shapes

2021-07-21 21:33:07 - Using CUDA... 2021-07-21 21:33:07 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/', dataset_type='voc', datasets=['/Downloads/jetson-inference/python/training/detection/ssd/data/shapes'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=30, num_workers=2, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2021-07-21 21:33:07 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 214, in target_transform=target_transform) File "/home/avocado/Downloads/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 33, in init raise IOError("missing ImageSet file {:s}".format(image_sets_file)) TypeError: unsupported format string passed to PosixPath.format

dusty-nv commented 3 years ago

@goodmorningcoffee yes you need a file containing the list of imageIDs under ImageSets/Main

As a shortcut, you can create just ImageSets/Main/default.txt with all the imageID's, and this will be used for train/test/ect

If your image filenames / ID's are consistent, you can probably make a bash script that creates the imageIDs files for you. Typically the imageIDs are the image filenames without the file extension

When in doubt, download the original Pascal VOC dataset and inspect it's structure and how it's layed out.

camilofernandez9405 commented 1 year ago

@dusty-nv mi hermano como esta voy hacer el entrenamiento y me sale este error y no se porque sera si me puedes guiar de como resolverlo Screenshot from 2022-12-15 16-41-22

dusty-nv commented 1 year ago

Can you try running export OPENBLAS_CORETYPE=ARMV8 first?

or change the numpy version

camilofernandez9405 commented 1 year ago

@dusty-nv no entiendo en que parte debo correr esta linea de codigo export OPENBLAS_CORETYPE=ARMV8

dusty-nv commented 1 year ago

Hi @camilofernandez9405, you should run export OPENBLAS_CORETYPE=ARMV8 in your terminal before you run your python3 train_ssd.py command

camilofernandez9405 commented 1 year ago

@dusty-nv ya pude solucionar el error pero ahora tengo este nuevo error que no se que podria ser Screenshot from 2022-12-16 09-35-22

dusty-nv commented 1 year ago

The path to your dataset is incorrect or your dataset is missing the image list files under ImageSets/Main

camilofernandez9405 commented 1 year ago

Screenshot from 2022-12-16 09-35-22 @dusty-nv no entiendo si he descargado las imagenes con este comando python3 open_iamges_downloader.py --max-images=5000 --class-name "insect" --data=data/prueba adjunto la imagen de los archivos que se descargan y donde luego corro el modelo y me arroja el error Screenshot from 2022-12-16 10-06-57

PARPedraza commented 1 year ago

Hi, we can help me please. I run jetson-train-main: !python train_ssd.py --dataset-type=voc --data=data --model-dir=data --batch-size=32 --workers=2 --epochs=100

The bbox in the Pascal VOC format or all coordinates is in their fractional form.

And I have this error:

image

dusty-nv commented 1 year ago

@PARPedraza it appears that somehow it is loading the labels.txt that was already exported from train_ssd.py, not your original labes.txt from the dataset:

VOC Labels read from file: (`BACKGROUND`, `0`, `1`, ...ect)

The BACKGROUND should not be in the labels.txt that is in your dataset's folder. That BACKGROUND class gets added by train_ssd.py and saved along with your model's folder. It should not be in the labels.txt that gets loaded by train_ssd.py

PARPedraza commented 1 year ago

I deleted the labels.txt files that were exported in other trainings, but the same error persists and I can't find the solution.

I move the batch-size and the process allows me to obtain at least one epoch.

image

dusty-nv commented 1 year ago

Is your --model-dir the same as your --data folder? It should be in a different folder. The labels.txt still shows multiple BACKGROUND classes in it. Your original data/labels.txt file shouldn't have BACKGROUND in it.

yuwanzi123 commented 6 months ago

Hi Dusty, thanks for your reply! I'm using CVAT to generate the annotation images and I exported the image in VOC format. I do see a "default.txt" file under /ImageSets/Main. But I still have this issue. Here is the log: wzy@wzy-desktop:~/jetson-inference/python/training/detection/ssd$ python3 train_ssd.py --dataset-type=voc --data=data/food_bin1 --model-dir=models/food_bin --batch-size=1 --num-workers=1 --num-epochs=1 2024-04-24 15:00:51 - Using CUDA... 2024-04-24 15:00:51 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=1, checkpoint_folder='models/food_bin', dataset_type='voc', datasets=['data/food_bin1'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, log_level='info', lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resolution=300, resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, validation_mean_ap=False, weight_decay=0.0005) 2024-04-24 15:02:04 - model resolution 300x300 2024-04-24 15:02:04 - SSDSpec(feature_map_size=19, shrinkage=16, box_sizes=SSDBoxSizes(min=60, max=105), aspect_ratios=[2, 3]) 2024-04-24 15:02:04 - SSDSpec(feature_map_size=10, shrinkage=32, box_sizes=SSDBoxSizes(min=105, max=150), aspect_ratios=[2, 3]) 2024-04-24 15:02:04 - SSDSpec(feature_map_size=5, shrinkage=64, box_sizes=SSDBoxSizes(min=150, max=195), aspect_ratios=[2, 3]) 2024-04-24 15:02:04 - SSDSpec(feature_map_size=3, shrinkage=100, box_sizes=SSDBoxSizes(min=195, max=240), aspect_ratios=[2, 3]) 2024-04-24 15:02:04 - SSDSpec(feature_map_size=2, shrinkage=150, box_sizes=SSDBoxSizes(min=240, max=285), aspect_ratios=[2, 3]) 2024-04-24 15:02:04 - SSDSpec(feature_map_size=1, shrinkage=300, box_sizes=SSDBoxSizes(min=285, max=330), aspect_ratios=[2, 3]) 2024-04-24 15:02:04 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 263, in <module> target_transform=target_transform) File "/home/wzy/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 44, in __init__ raise IOError(f"missing ImageSet file {image_sets_file}") OSError: missing ImageSet file data/food_bin1/ImageSets/Main/trainval.txt Then I manually created four .txt file: test.txt; train.txt; trainval.txt; val.txt; and I copy the image IDs from the default.txt and paste them to all four new files. But still, I got the same error. Can you please help me out? Thank you so much in advance!

EDIT: I found my mistake, the path to the data is wrong. There is another directory between food_bin1 and ImageSets. But I still have a question. Can I manually create four .txt files? If yes, then should I copy all 65 image IDs to those four .txt file or I have to split maybe 20+20+20+5 to different .txt file? Thanks!

dusty-nv commented 6 months ago

Hi @yuwanzi123, glad you got it working - yes, you can create the 4 separate files and split the dataset between them (well, except that trainval.txt is just train+val splits). Normally the split is like 70% train, 15% val, 15% test. However your dataset is very small so I probably wouldn't split it up, until you collect more data.

yuwanzi123 commented 6 months ago

Got it, thank you so much Dusty!