Can not Trian Custom Dataset

wy3406 commented 4 years ago

Hi, I have prepared the images and labels to be placed separately ./data/custom/images and ./data/custom/label I see you have the arg.rect parameter，I don't know how to generate relative one

Lornatang commented 4 years ago

@wy3406 Can you show me your tree structure in the 'data' directory? Note:Rect parameters are used for training of the same width height size, rather than according to the actual width height ratio of the object, which will improve a lot of training speed

wy3406 commented 4 years ago

@Lornatang custom/ ├── classes.names ├── images │ ├── 1582254136.1922057.jpg │ ├── 1582254137.4090116.jpg │ ├── 1582254138.5632827.jpg │ ├── 1582254139.6776469.jpg │ ├── 1582254140.9018676.jpg ├── labels │ ├── 1582254136.1922057.txt │ ├── 1582254137.4090116.txt │ ├── 1582254138.5632827.txt │ ├── 1582254139.6776469.txt ├── train.txt └── valid.txt

Lornatang commented 4 years ago

@wy3406 The simplest way to run it is as follows. If there is an error, please post your error message.

please run

python3 train.py --cfg cfg/yolov3-custom.cfg --data --cfg/custom.data --weights ""

wy3406 commented 4 years ago

I found the error. I wrote the label as ‘label_idx，x_center，y_center，width height’. but i get another error. Here is the message I got： Namespace(accumulate=4, arch='default', batch_size=16, cache_images=False, cfg='cfg/yolov3-custom.cfg', data='cfg/voc2007_hand.data', device='', epochs=100, evolve=False, image_size=[416], multi_scale=True, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='', workers=4) Using CUDA

device:0 (name=TITAN RTX, total_memory=24190MB)
device:1 (name=TITAN RTX, total_memory=24190MB)
device:2 (name=TITAN RTX, total_memory=24190MB)
device:3 (name=TITAN RTX, total_memory=24190MB)

2020-03-13 07:32:44.682786: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 07:32:44.682883: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 07:32:44.682897: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Using multi-scale 288 - 640 Pre training model weight not loaded Caching labels (4644 found, 0 missing, 0 empty, 0 duplicate, for 4644 images): 100%|██████████████████████████████████████████████████████████████████████████████████| 4644/4644 [00:00<00:00, 9139.88it/s] Caching labels (463 found, 0 missing, 0 empty, 0 duplicate, for 463 images): 100%|██████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 9802.05it/s] Model Summary: 222 layers, 6.18882e+07 parameters, 6.18882e+07 gradients Using 4 dataloader workers. Starting training for 100 epochs...

 Epoch    memory      GIoU       obj       cls     total   targets image_size

0%| | 0/291 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 435, in train() File "train.py", line 276, in train output = model(images) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 387, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 408, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/yolov3/YOLOv3-PyTorch/models.py", line 295, in forward yolo_out.append(module(x, img_size, out)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, **kwargs) File "/yolov3/YOLOv3-PyTorch/models.py", line 220, in forward p = p.view(bs, self.na, self.no, self.ny, self.nx).permute(0, 1, 3, 4, 2).contiguous() # prediction RuntimeError: shape '[4, 3, 6, 20, 20]' is invalid for input of size 408000

Lornatang commented 4 years ago

@wy3406 you are receiving these errors because your custom cfg files are not correctly formatted.

Yolov3-custom.cfg the filter size and classes size before Yolo layer need to be modified. As far as I know, there are four places that need to be modified, all at the end of the configuration file

Lornatang commented 4 years ago

@wy3406 Pull the latest code. I have solved this problem.

wy3406 commented 4 years ago

First ,I just determined that the height and width of all images are greater than 608

wy3406 commented 4 years ago

@Lornatang Another error. Here is the message I got： File "/yolov3/YOLOv3-PyTorch/models.py", line 254, in init self.module_defs = parse_model_cfg(cfg) File "/yolov3/YOLOv3-PyTorch/utils/parse_config.py", line 62, in parse_model_cfg assert not any(u), "Unsupported fields %s in %s. See https://github.com/ultralytics/yolov3/issues/631" % (u, path) AssertionError: Unsupported fields ['batch', 'subdivisions', 'width', 'height', 'channels', 'momentum', 'decay', 'angle', 'saturation', 'exposure', 'hue', 'learning_rate', 'burn_in', 'max_batches', 'policy', 'steps', 'scales'] in cfg/yolov3-custom.cfg. See https://github.com/ultralytics/yolov3/issues/631

Lornatang commented 4 years ago

@wy3406 Please give me your cfg file Also, please run the following command again in the cfg directory

bash create_model.sh <num-classes>

wy3406 commented 4 years ago

@Lornatang The way to generate Cfg is ‘bash create_model.sh 1’

Lornatang commented 4 years ago

Make sure that the lines in your profile are the same as the following. This configuration layer appears three times, all at the end.

[convolutional]
size = 1
stride = 1
pad = 1
filters = 18
activation = linear

[yolo]
mask = 0,1,2
anchors = 10,13,  16,30,  33,23,  30,61,  62,45,  59,119,  116,90,  156,198,  373,326
classes = 1
num = 9
jitter = .3
ignore_thresh = .7
truth_thresh = 1
random = 1

wy3406 commented 4 years ago

Thank you, solved the problem

wy3406 commented 4 years ago

@Lornatang Get the error in the evaluate section： RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Lornatang commented 4 years ago

@wy3406 What is your evaluation command?

wy3406 commented 4 years ago

@Lornatang Namespace(accumulate=4, arch='default', batch_size=32, cache_images=False, cfg='cfg/yolov3-custom.cfg', data='cfg/voc2007_hand.data', device='', epochs=200, evolve=False, image_size=[416], multi_scale=True, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='./weights/yolov3.weights', workers=8) Using CUDA

device:0 (name=TITAN RTX, total_memory=24190MB)
device:1 (name=TITAN RTX, total_memory=24190MB)

2020-03-13 09:16:49.009174: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 09:16:49.009255: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 09:16:49.009265: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Using multi-scale 288 - 640 Caching labels (4644 found, 0 missing, 0 empty, 0 duplicate, for 4644 images): 100%|█████████████████████████████████████████████████████████████████████████████████| 4644/4644 [00:00<00:00, 10068.86it/s] Caching labels (463 found, 0 missing, 0 empty, 0 duplicate, for 463 images): 100%|█████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 10145.94it/s] Model Summary: 222 layers, 6.17667e+07 parameters, 6.17667e+07 gradients Using 8 dataloader workers. Starting training for 200 epochs...

 Epoch    memory      GIoU       obj       cls     total   targets image_size
 0/199    20.13G      6.07      9.95         0        16        13       608: 100%|███████████████████████████████████████████████████████████████████████████████████| 146/146 [04:23<00:00,  1.81s/it]
           Class    Images   Targets         P         R   mAP@0.5        F1:   0%|                                                                                               | 0/8 [00:04<?, ?it/s]

Traceback (most recent call last): File "train.py", line 425, in train() File "train.py", line 321, in train dataloader=valid_dataloader) File "/yolov3/YOLOv3-PyTorch/YOLOv3-PyTorch/test.py", line 114, in evaluate inf_out, train_out = model(imgs) # inference and training outputs File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 387, in forward outputs = self.parallel_apply(self._module_copies[:len(inputs)], inputs, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 408, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 494, in call result = self.forward(input, kwargs) File "/yolov3/YOLOv3-PyTorch/YOLOv3-PyTorch/models.py", line 308, in forward return torch.cat(io, 1), p RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71

Lornatang commented 4 years ago

@wy3406

The dimensions of the image data you input are not exactly the same. For example, there are 100 groups of training data, of which 99 groups are 256 256, but one group is 384 384, which will lead to errors in the python checker
Another is the problem of cryptic batchsize. In Python, check that your training dimensions are correct according to the dimensions of each batchsize. For example, if you have 1000 groups of data (assuming that each group of data is an image of three channels of 256px 256px) and the batchsize is 4, then extract the tensor of (4,3,256,256) dimension for each training to train, just 250 epochs to solve (250 4 = 1000). But if you have 999 groups of data and you continue to use batchsize 4, 999 and 4 can't be divided. The tensor dimensions of 249 groups before training are (4,3256256), but the dimensions of the last batch are (3,3,256,256). Python checks that (4,3,256,256)! = (3,3,256,256). If the dimensions don't match, it will report an error. This can be called a small bug.

My suggestion: You should check the configuration of the full connection layer in your yolov3 configuration file

Lornatang commented 4 years ago

@wy3406 pull it. And I recommand --weights "" or --weights weights/model_best.pth no run --weights weights/yolov3.weights

Lornatang commented 4 years ago

@wy3406 Is the dataset you are using an Oxford gesture recognition dataset?

wy3406 commented 4 years ago

@Lornatang No，is a custom datasets； The reason for "RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71" is because of cfg. file error； 6 means xywh + 2 categories ,it's the custom datasets 85 means xywh + 81 categories ,it's the coco datasets

Lornatang commented 4 years ago

@Lornatang No，is a custom datasets； The reason for "RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71" is because of cfg. file error； 6 means xywh + 2 categories ,it's the custom datasets 85 means xywh + 81 categories ,it's the coco datasets

filter != 3 (num_classes + xywh + 1)? why you cal is 6? not `3 (2 + 4 + 1)=21?`

wy3406 commented 4 years ago

@Lornatang Yes, you are right filter setting in cfg.file is 255 Two categories indicate that one is the background

Lornatang commented 4 years ago

@wy3406 You are welcome to give me more suggestions. Thanks

Lornatang / YOLOv3-PyTorch

Can not Trian Custom Dataset #1