dusty-nv / jetson-inference

Hello AI World guide to deploying deep-learning inference networks and deep vision primitives with TensorRT and NVIDIA Jetson.
https://developer.nvidia.com/embedded/twodaystoademo
MIT License
7.89k stars 2.99k forks source link

Error training ssd-mobilenet from custom dataset #1370

Closed e-mily closed 1 year ago

e-mily commented 2 years ago

@dusty-nv I followed the tutorial and created train.txt , test.txt , val.txt and trainval.txt in the ImageSets/Main. I even switched to just having default.txt in the ImageSets/Main and I'm still getting the following error. Can you help me?

root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/total-5 --model=models/total-5 --batch-size=2 --workers=1 --epochs=1 2022-02-23 09:22:37 - Using CUDA... 2022-02-23 09:22:37 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=2, checkpoint_folder='models/total-5', dataset_type='voc', datasets=['data/total-5'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=1, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2022-02-23 09:22:37 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 214, in target_transform=target_transform) File "/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 33, in init raise IOError("missing ImageSet file {:s}".format(image_sets_file)) TypeError: unsupported format string passed to PosixPath.format

dusty-nv commented 2 years ago

Hmm... hi @e-mily, can you share the output of ls /jetson-inference/python/training/detection/ssd/data/total-5/ImageSets/Main with me?

e-mily commented 2 years ago

2

e-mily commented 2 years ago

2

sorry @dusty-nv I was able to train because I misplaced my dataset in the wrong folder

e-mily commented 2 years ago

I have other questions to ask:

  1. how do i do image augmentation using the tutorial?
  2. is there a way to do an underfitting and overfitting?
  3. how can i change the number of layers being trained by detectnet?
  4. if the imagesets/main/default.txt, how does the code divide the dataset into train, test, validation? is there a certain percentage to it? (I was only able to train with imagesets/main/default.txt)
dusty-nv commented 2 years ago
  1. how do i do image augmentation using the tutorial?

Image augmentation is already done automatically by the TrainAugmentation transforms: https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/ssd/data_preprocessing.py#L4

So if you want, you can add to them there.

3. how can i change the number of layers being trained by detectnet?

You would need to change the SSD network definitions under https://github.com/dusty-nv/pytorch-ssd/tree/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/ssd (I have not attempted this)

4. if the imagesets/main/default.txt, how does the code divide the dataset into train, test, validation? is there a certain percentage to it? (I was only able to train with imagesets/main/default.txt)

default.txt uses the same dataset across train and test, so it doesn't split it. If you want it split, you should have different trainval.txt and test.txt files under ImageSets/Main

e-mily commented 2 years ago

Thank you @dusty-nv. That was really helpful.

But then when I tried to put them into trainval.txt, test.txt, val.txt etc I received the error as stated above.

e-mily commented 2 years ago

When I tried to run livestream upon building the model. I realized my camera feed is flipped. Is there any way to flipped it back? I'm using Jetson TX2

dusty-nv commented 2 years ago

But then when I tried to put them into trainval.txt, test.txt, val.txt etc I received the error as stated above.

So do you have the file: total-5/ImageSets/Main/trainval.txt and total-5/ImageSets/Main/test.txt ? Does your user have permissions to read them?

They are looked for in the code here: https://github.com/dusty-nv/pytorch-ssd/blob/3f9ba554e33260c8c493a927d7c4fdaa3f388e72/vision/datasets/voc_dataset.py#L22

When I tried to run livestream upon building the model. I realized my camera feed is flipped. Is there any way to flipped it back? I'm using Jetson TX2

Yes, try running it with --input-flip=rotate-180

For more info, see here: https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md#input-options

e-mily commented 2 years ago

So do you have the file: total-5/ImageSets/Main/trainval.txt and total-5/ImageSets/Main/test.txt ? Does your user have permissions to read them?

I did. But it give TypeError: unsupported format string passed to PosixPath.format
But if i change it to total-5/ImageSets/Main/default.txt then it works! Erm how do i know if user has permission to read them?

e-mily commented 2 years ago

@dusty-nv I realized the models are write-protected. how do i remove that so that i can delete it? because i want to change the parameters and train the model again.

Btw I was able to train with trainval.txt and val.txt! Thank you!

e-mily commented 2 years ago

error I have this error when i try to train the same model with increased epoch value

e-mily commented 2 years ago

if i decrease the workers=0 i still get the same error. I also tried to swap the memory (i don't know if i did it correctly i dont really understand what im looking at) I have an sd card attached to the jetson tx2. will it help?

dusty-nv commented 2 years ago

I realized the models are write-protected. how do i remove that so that i can delete it?

You can use command like sudo chown -R <your-user> <path-to-model-dir>

if i decrease the workers=0 i still get the same error. I also tried to swap the memory (i don't know if i did it correctly i dont really understand what im looking at)

The killed message you are get normally means the board has run out of memory. I recommend running with --batch-size=1 and --workers=0 to decrease the memory usage. Also here are the instructions for mounting swap, disabling ZRAM, and disabling the desktop GUI:

e-mily commented 2 years ago

Thank you @dusty-nv ! I was training my model with increasing epoch and i found out that the more epoch i have. when i test my model with test images. I dont see any bounding boxes as all. i dont see any confidence level displayed in the terminal as well. What do i do?

e-mily commented 2 years ago

traffic like this one. Im suppose to have 3 attirbutes but it can only detect 1. I don't know why the bounding box is so small.

detectnet --model=models/5-imagesa/ssd-mobilenet.onnx --labels=models/5-images/labels.txt --input-blob=input_0 --output-cvg=scores --output-bbox=boxes "/jetson-inference/data/imagess/traffic_*.jpeg" /jetson-inference/data/imagess/test2/traffic_%i.jpeg This is the code i ran.

e-mily commented 2 years ago

Its either that or I'm not getting any results at all with increasing epoch. ![Uploading traffic2.jpeg…]()

dusty-nv commented 2 years ago

Can you try deleting the *.engine file from your model's folder and try running detectnet program again?

How many epochs did you train it for? Normally at least 30 is needed for good results. You can run the pytorch-ssd code on a Linux/Ubuntu PC for faster training (you will need to install PyTorch on it and such)

Also, you can use the run_ssd_example.py script to test one of your PyTorch .pth model checkpoints before it gets exported to ONNX. This will help you to confirm if the model is in fact trained to your liking first.

e-mily commented 2 years ago

Can you give me the full command to run run_ssd_example.py? I tried from 5 epoch and increasing to 50. It only shows accuracy for 5 epoch and 10 epoch. Afterwards it just seems like it couldnt detect anything as it wasn't showing any accuracy figure.

dusty-nv commented 2 years ago

Can you give me the full command to run run_ssd_example.py?

python3 run_ssd_example.py mb1-ssd <path-to-pth-checkpoint> <path-to-labels.txt> <path-to-test-image>
e-mily commented 2 years ago

python3 run_ssd_example.py mb1-ssd

root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 run_ssd_example.py mb1-ssd models/20-imagesa/mb1-ssd-Epoch-9-Loss-7.462369181893089.pth models/20-imagesa/labels.txt /jetson-inference/data/imagess/test/traffic_%i.jpeg

Traceback (most recent call last): File "run_ssd_example.py", line 50, in <module> image = cv2.cvtColor(orig_image, cv2.COLOR_BGR2RGB) cv2.error: OpenCV(4.5.0) /opt/opencv/modules/imgproc/src/color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cvtColor'

I tried like that but i got this error...

e-mily commented 2 years ago

So i guess the correct command is root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 run_ssd_example.py mb1-ssd models/20-imagesa/mb1-ssd-Epoch-9-Loss-7.462369181893089.pth models/20-imagesa/labels.txt /jetson-inference/data/imagess/traffic_8.jpeg

Inference time: 2.8669397830963135 Found 0 objects. The output image is run_ssd_example_output.jpg

what do i do? i followed through every steps...

I'll try with increasing epochs. Just curious, shouldn't it be able to detect anything even with very low accuracy?

e-mily commented 2 years ago

root@aititx22-desktop:/jetson-inference/python/training/detection/ssd# python3 run_ssd_example.py mb1-ssd models/20-imagesa/mb1-ssd-Epoch-99-Loss-4.31419215780316.pth models/20-imagesa/labels.txt /jetson-inference/data/imagess/traffic_8.jpeg

Inference time: 4.292574882507324 Found 0 objects. The output image is run_ssd_example_output.jpg

still zero objects found after running for 100 epochs...

what did i do wrong?

dusty-nv commented 2 years ago

How many images are in your dataset? Are the objects easily discernible? Are they small? It seems like the objects you are training it on may be difficult for it to recognize.

e-mily commented 2 years ago

How many images are in your dataset? Are the objects easily discernible? Are they small? It seems like the objects you are training it on may be difficult for it to recognize.

Im training 20 images for 3 annotations. The objects are not small. Im training it from different distance. Im aware you need at least 100 images per annotations to train but i dont have that much dataset per annotations.

Is there a way to increase the dataset through image augmentation??

I wanna analyze the accuracy with increasing images per annotations and increasing epochs... But i cant get any accuracy out...

dusty-nv commented 2 years ago

Im training 20 images for 3 annotations. The objects are not small. Im training it from different distance. Im aware you need at least 100 images per annotations to train but i dont have that much dataset per annotations.

OK yes, you are going to need more images in your dataset. What are your 3 object classes? If they are all road signs, that you want to tell apart just by their different text, that may be more challenging for the DNN and you may need even more images in your dataset.

Is there a way to increase the dataset through image augmentation??

The train_ssd.py script already is doing image augmentation

e-mily commented 2 years ago

i see. I'll try again with increasing image.

Instead of camera stream or test images, can i use video to test the accuracy of my model with detectnet?

If so, what is the command for that?

dusty-nv commented 2 years ago

Hi @e-mily, detectnet/detectnet.py doesn't have built-in accuracy, because it has no knowledge of the ground-truth data. It is meant for inferencing only. It's on the PyTorch side that has knowledge of the dataset and groundtruth.

e-mily commented 2 years ago

thank you @dusty-nv. I have another issue. I created a new sets of dataset to increase the number of images and labels. When i try to run train_ssd.py it gives TypeError: unsupported format string passed to PosixPath.__format__ error.

I re-attempt with the old datasets and it works! But i want to use to new datasets.

When i compare between the old and new datasets they look the same to me. So, I don't really know whats the real issue is. What do you think?

e-mily commented 2 years ago

https://drive.google.com/drive/folders/1--DIZr1JPnETLCfGm6gnYrfAuQXxAdRn?usp=sharing

This is the link to my dataset. it would be a great help if you can check it out.

i tried using the command --debug-steps=1 and I also command out the part from voc_dataset.py but Im not sure how to commit the change in the container.

e-mily commented 2 years ago

And also i still can't seem to divide them into trainval.txt and test.txt

dusty-nv commented 2 years ago

When i try to run train_ssd.py it gives TypeError: unsupported format string passed to PosixPath.__format__ error.

Can you provide the full error/exception output from the console, so I can see where in the code it is happening at?

dusty-nv commented 2 years ago

i tried using the command --debug-steps=1 and I also command out the part from voc_dataset.py but Im not sure how to commit the change in the container.

You would want to edit this inside the container using the nano editor, or just run it without container by installing from source. Or I guess you could mount the jetson-inference/pytorch-ssd source code into the container, that would work too.

e-mily commented 2 years ago

thank you @dusty-nv turns out it was from my dataset. I want to ask how do i train for different models?

dusty-nv commented 2 years ago

The ssd-mobilenet-v1 is the only network architecture from pytorch-ssd that I have tested & verified is working through the whole pipeline, including the ONNX export from PyTorch and import into TensorRT and runtime pre/post-processing with jetson-inference

chromaowl commented 2 years ago

to @dusty-nv I am at the same spot that opened this thread; I have the line 214 error and I checked my directory and I do have read and write permission with the 4 files in the directory. There were so many other issues listed that I am not sure what solved the problem. Can you tell me what I should try next.

chromaowl commented 2 years ago

to @dusty-nv - redid the entire process with a simpler set of objects; just 3 styles of batteries with 3 of each in many positions. When I run the train_ssd.py I still get stuck at line 214. I am sure I am missing something simple. Thanks, Stephen

dusty-nv commented 2 years ago

@chromaowl can you provide the terminal log of the error you are getting?

Are you sure you're providing the correct path to your dataset when you launch train_ssd.py?

chromaowl commented 2 years ago

root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd# python3 train_ssd.py --dataset-type=voc --data=data/batteries --model-dir=models/batteries --batch-size=4 --epochs=2 --workers=1 2022-07-21 15:54:03 - Using CUDA... 2022-07-21 15:54:03 - Namespace(balance_data=False, base_net=None, base_net_lr=0.001, batch_size=4, checkpoint_folder='models/batteries', dataset_type='voc', datasets=['data/batteries'], debug_steps=10, extra_layers_lr=None, freeze_base_net=False, freeze_net=False, gamma=0.1, lr=0.01, mb2_width_mult=1.0, milestones='80,100', momentum=0.9, net='mb1-ssd', num_epochs=2, num_workers=1, pretrained_ssd='models/mobilenet-v1-ssd-mp-0_675.pth', resume=None, scheduler='cosine', t_max=100, use_cuda=True, validation_epochs=1, weight_decay=0.0005) 2022-07-21 15:54:03 - Prepare training datasets. Traceback (most recent call last): File "train_ssd.py", line 214, in target_transform=target_transform) File "/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 47, in init for line in infile: File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd#

chromaowl commented 2 years ago

This is the path to my data: root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd# cd data/batteries root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd/data/batteri es# ls -l total 16 drwxr-xr-x 2 root root 4096 Jul 20 21:02 Annotations drwxr-xr-x 3 root root 4096 Jul 20 20:09 ImageSets drwxr-xr-x 2 root root 4096 Jul 20 21:02 JPEGImages -rw-rw-r-- 1 1000 1000 17 Jul 20 21:22 labels.txt root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd/data/batteri es# ^C root@VCEDbreadboard:/jetson-inference/python/training/detection/ssd/data/batteries#

sachaai commented 1 year ago

@dusty-nv @chromaowl I am also getting ascii error. Could you please tell me how you fixed the issue: 2023-10-02 14:59:34 - Prepare training datasets. warning - image 20231002-115317 has no box/labels annotations, ignoring from dataset Traceback (most recent call last): File "train_ssd.py", line 263, in target_transform=target_transform) File "/jetson-inference/python/training/detection/ssd/vision/datasets/voc_dataset.py", line 58, in init for line in infile: File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128) root@linux:/jetson-inference/python/training/detection/ssd# python3 --version Python 3.6.9