Lornatang / YOLOv3-PyTorch

Pytorch implements yolov3.Good performance, easy to use, fast speed.
Apache License 2.0
33 stars 15 forks source link

Can not Trian Custom Dataset #1

Closed wy3406 closed 4 years ago

wy3406 commented 4 years ago

Hi, I have prepared the images and labels to be placed separately ./data/custom/images and ./data/custom/label I see you have the arg.rect parameter,I don't know how to generate relative one

Lornatang commented 4 years ago

@wy3406 Can you show me your tree structure in the 'data' directory? Note:Rect parameters are used for training of the same width height size, rather than according to the actual width height ratio of the object, which will improve a lot of training speed

wy3406 commented 4 years ago

@Lornatang custom/ ├── classes.names ├── images │   ├── 1582254136.1922057.jpg │   ├── 1582254137.4090116.jpg │   ├── 1582254138.5632827.jpg │   ├── 1582254139.6776469.jpg │   ├── 1582254140.9018676.jpg ├── labels │   ├── 1582254136.1922057.txt │   ├── 1582254137.4090116.txt │   ├── 1582254138.5632827.txt │   ├── 1582254139.6776469.txt ├── train.txt └── valid.txt

Lornatang commented 4 years ago

@wy3406 The simplest way to run it is as follows. If there is an error, please post your error message.

please run

python3 train.py --cfg cfg/yolov3-custom.cfg --data --cfg/custom.data --weights ""
wy3406 commented 4 years ago

I found the error. I wrote the label as ‘label_idx,x_center,y_center,width height’. but i get another error. Here is the message I got: Namespace(accumulate=4, arch='default', batch_size=16, cache_images=False, cfg='cfg/yolov3-custom.cfg', data='cfg/voc2007_hand.data', device='', epochs=100, evolve=False, image_size=[416], multi_scale=True, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='', workers=4) Using CUDA

2020-03-13 07:32:44.682786: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 07:32:44.682883: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 07:32:44.682897: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Using multi-scale 288 - 640 Pre training model weight not loaded Caching labels (4644 found, 0 missing, 0 empty, 0 duplicate, for 4644 images): 100%|██████████████████████████████████████████████████████████████████████████████████| 4644/4644 [00:00<00:00, 9139.88it/s] Caching labels (463 found, 0 missing, 0 empty, 0 duplicate, for 463 images): 100%|██████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 9802.05it/s] Model Summary: 222 layers, 6.18882e+07 parameters, 6.18882e+07 gradients Using 4 dataloader workers. Starting training for 100 epochs...

 Epoch    memory      GIoU       obj       cls     total   targets image_size

0%| | 0/291 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 435, in train() File "train.py", line 276, in train output = model(images) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 387, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 408, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/yolov3/YOLOv3-PyTorch/models.py", line 295, in forward yolo_out.append(module(x, img_size, out)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, **kwargs) File "/yolov3/YOLOv3-PyTorch/models.py", line 220, in forward p = p.view(bs, self.na, self.no, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous() # prediction RuntimeError: shape '[4, 3, 6, 20, 20]' is invalid for input of size 408000

Lornatang commented 4 years ago

@wy3406 you are receiving these errors because your custom cfg files are not correctly formatted.

Yolov3-custom.cfg the filter size and classes size before Yolo layer need to be modified. As far as I know, there are four places that need to be modified, all at the end of the configuration file

Lornatang commented 4 years ago

@wy3406 Pull the latest code. I have solved this problem.

wy3406 commented 4 years ago

First ,I just determined that the height and width of all images are greater than 608

wy3406 commented 4 years ago

@Lornatang Another error. Here is the message I got: File "/yolov3/YOLOv3-PyTorch/models.py", line 254, in init self.module_defs = parse_model_cfg(cfg) File "/yolov3/YOLOv3-PyTorch/utils/parse_config.py", line 62, in parse_model_cfg assert not any(u), "Unsupported fields %s in %s. See https://github.com/ultralytics/yolov3/issues/631" % (u, path) AssertionError: Unsupported fields ['batch', 'subdivisions', 'width', 'height', 'channels', 'momentum', 'decay', 'angle', 'saturation', 'exposure', 'hue', 'learning_rate', 'burn_in', 'max_batches', 'policy', 'steps', 'scales'] in cfg/yolov3-custom.cfg. See https://github.com/ultralytics/yolov3/issues/631

Lornatang commented 4 years ago

@wy3406 Please give me your cfg file Also, please run the following command again in the cfg directory

bash create_model.sh <num-classes> 
wy3406 commented 4 years ago

@Lornatang The way to generate Cfg is ‘bash create_model.sh 1’

Lornatang commented 4 years ago

Make sure that the lines in your profile are the same as the following. This configuration layer appears three times, all at the end.

[convolutional]
size = 1
stride = 1
pad = 1
filters = 18
activation = linear

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes = 1
num = 9
jitter = .3
ignore_thresh = .7
truth_thresh = 1
random = 1
wy3406 commented 4 years ago

Thank you, solved the problem

wy3406 commented 4 years ago

@Lornatang Get the error in the evaluate section: RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Lornatang commented 4 years ago

@wy3406 What is your evaluation command?

wy3406 commented 4 years ago

@Lornatang Namespace(accumulate=4, arch='default', batch_size=32, cache_images=False, cfg='cfg/yolov3-custom.cfg', data='cfg/voc2007_hand.data', device='', epochs=200, evolve=False, image_size=[416], multi_scale=True, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='./weights/yolov3.weights', workers=8) Using CUDA

2020-03-13 09:16:49.009174: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 09:16:49.009255: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 09:16:49.009265: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Using multi-scale 288 - 640 Caching labels (4644 found, 0 missing, 0 empty, 0 duplicate, for 4644 images): 100%|█████████████████████████████████████████████████████████████████████████████████| 4644/4644 [00:00<00:00, 10068.86it/s] Caching labels (463 found, 0 missing, 0 empty, 0 duplicate, for 463 images): 100%|█████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 10145.94it/s] Model Summary: 222 layers, 6.17667e+07 parameters, 6.17667e+07 gradients Using 8 dataloader workers. Starting training for 200 epochs...

 Epoch    memory      GIoU       obj       cls     total   targets image_size
 0/199    20.13G      6.07      9.95         0        16        13       608: 100%|███████████████████████████████████████████████████████████████████████████████████| 146/146 [04:23<00:00,  1.81s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1:   0%|                                                                                               | 0/8 [00:04<?, ?it/s]

Traceback (most recent call last): File "train.py", line 425, in train() File "train.py", line 321, in train dataloader=valid_dataloader) File "/yolov3/YOLOv3-PyTorch/YOLOv3-PyTorch/test.py", line 114, in evaluate inf_out, train_out = model(imgs) # inference and training outputs File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 387, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 408, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/yolov3/YOLOv3-PyTorch/YOLOv3-PyTorch/models.py", line 308, in forward return torch.cat(io, 1), p RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Lornatang commented 4 years ago

@wy3406

My suggestion: You should check the configuration of the full connection layer in your yolov3 configuration file

Lornatang commented 4 years ago

@wy3406 pull it. And I recommand --weights "" or --weights weights/model_best.pth no run --weights weights/yolov3.weights

Lornatang commented 4 years ago

@wy3406 Is the dataset you are using an Oxford gesture recognition dataset?

wy3406 commented 4 years ago

@Lornatang No,is a custom datasets; The reason for "RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71" is because of cfg. file error; 6 means xywh + 2 categories ,it's the custom datasets 85 means xywh + 81 categories ,it's the coco datasets

Lornatang commented 4 years ago

@Lornatang No,is a custom datasets; The reason for "RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71" is because of cfg. file error; 6 means xywh + 2 categories ,it's the custom datasets 85 means xywh + 81 categories ,it's the coco datasets

filter != 3 (num_classes + xywh + 1)? why you cal is 6? not `3 (2 + 4 + 1)=21?`

wy3406 commented 4 years ago

@Lornatang Yes, you are right filter setting in cfg.file is 255 Two categories indicate that one is the background

Lornatang commented 4 years ago

@wy3406 You are welcome to give me more suggestions. Thanks