Closed wy3406 closed 4 years ago
@wy3406 Can you show me your tree structure in the 'data' directory? Note:Rect parameters are used for training of the same width height size, rather than according to the actual width height ratio of the object, which will improve a lot of training speed
@Lornatang custom/ ├── classes.names ├── images │ ├── 1582254136.1922057.jpg │ ├── 1582254137.4090116.jpg │ ├── 1582254138.5632827.jpg │ ├── 1582254139.6776469.jpg │ ├── 1582254140.9018676.jpg ├── labels │ ├── 1582254136.1922057.txt │ ├── 1582254137.4090116.txt │ ├── 1582254138.5632827.txt │ ├── 1582254139.6776469.txt ├── train.txt └── valid.txt
@wy3406 The simplest way to run it is as follows. If there is an error, please post your error message.
please run
python3 train.py --cfg cfg/yolov3-custom.cfg --data --cfg/custom.data --weights ""
I found the error. I wrote the label as ‘label_idx,x_center,y_center,width height’. but i get another error. Here is the message I got: Namespace(accumulate=4, arch='default', batch_size=16, cache_images=False, cfg='cfg/yolov3-custom.cfg', data='cfg/voc2007_hand.data', device='', epochs=100, evolve=False, image_size=[416], multi_scale=True, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='', workers=4) Using CUDA
device:0 (name=TITAN RTX
, total_memory=24190MB)
device:1 (name=TITAN RTX
, total_memory=24190MB)
device:2 (name=TITAN RTX
, total_memory=24190MB)
device:3 (name=TITAN RTX
, total_memory=24190MB)
2020-03-13 07:32:44.682786: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 07:32:44.682883: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 07:32:44.682897: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Using multi-scale 288 - 640 Pre training model weight not loaded Caching labels (4644 found, 0 missing, 0 empty, 0 duplicate, for 4644 images): 100%|██████████████████████████████████████████████████████████████████████████████████| 4644/4644 [00:00<00:00, 9139.88it/s] Caching labels (463 found, 0 missing, 0 empty, 0 duplicate, for 463 images): 100%|██████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 9802.05it/s] Model Summary: 222 layers, 6.18882e+07 parameters, 6.18882e+07 gradients Using 4 dataloader workers. Starting training for 100 epochs...
Epoch memory GIoU obj cls total targets image_size
0%| | 0/291 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 435, in
@wy3406 you are receiving these errors because your custom cfg files are not correctly formatted.
Yolov3-custom.cfg
the filter size and classes size before Yolo layer need to be modified. As far as I know, there are four places that need to be modified, all at the end of the configuration file
@wy3406 Pull the latest code. I have solved this problem.
First ,I just determined that the height and width of all images are greater than 608
@Lornatang Another error. Here is the message I got: File "/yolov3/YOLOv3-PyTorch/models.py", line 254, in init self.module_defs = parse_model_cfg(cfg) File "/yolov3/YOLOv3-PyTorch/utils/parse_config.py", line 62, in parse_model_cfg assert not any(u), "Unsupported fields %s in %s. See https://github.com/ultralytics/yolov3/issues/631" % (u, path) AssertionError: Unsupported fields ['batch', 'subdivisions', 'width', 'height', 'channels', 'momentum', 'decay', 'angle', 'saturation', 'exposure', 'hue', 'learning_rate', 'burn_in', 'max_batches', 'policy', 'steps', 'scales'] in cfg/yolov3-custom.cfg. See https://github.com/ultralytics/yolov3/issues/631
@wy3406
Please give me your cfg
file
Also, please run the following command again in the cfg
directory
bash create_model.sh <num-classes>
@Lornatang The way to generate Cfg is ‘bash create_model.sh 1’
Make sure that the lines in your profile are the same as the following. This configuration layer appears three times, all at the end.
[convolutional]
size = 1
stride = 1
pad = 1
filters = 18
activation = linear
[yolo]
mask = 0,1,2
anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326
classes = 1
num = 9
jitter = .3
ignore_thresh = .7
truth_thresh = 1
random = 1
Thank you, solved the problem
@Lornatang Get the error in the evaluate section: RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71
@wy3406 What is your evaluation command?
@Lornatang Namespace(accumulate=4, arch='default', batch_size=32, cache_images=False, cfg='cfg/yolov3-custom.cfg', data='cfg/voc2007_hand.data', device='', epochs=200, evolve=False, image_size=[416], multi_scale=True, nosave=False, notest=False, rect=False, resume=False, single_cls=False, weights='./weights/yolov3.weights', workers=8) Using CUDA
device:0 (name=TITAN RTX
, total_memory=24190MB)
device:1 (name=TITAN RTX
, total_memory=24190MB)
2020-03-13 09:16:49.009174: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 09:16:49.009255: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2020-03-13 09:16:49.009265: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. Using multi-scale 288 - 640 Caching labels (4644 found, 0 missing, 0 empty, 0 duplicate, for 4644 images): 100%|█████████████████████████████████████████████████████████████████████████████████| 4644/4644 [00:00<00:00, 10068.86it/s] Caching labels (463 found, 0 missing, 0 empty, 0 duplicate, for 463 images): 100%|█████████████████████████████████████████████████████████████████████████████████████| 463/463 [00:00<00:00, 10145.94it/s] Model Summary: 222 layers, 6.17667e+07 parameters, 6.17667e+07 gradients Using 8 dataloader workers. Starting training for 200 epochs...
Epoch memory GIoU obj cls total targets image_size
0/199 20.13G 6.07 9.95 0 16 13 608: 100%|███████████████████████████████████████████████████████████████████████████████████| 146/146 [04:23<00:00, 1.81s/it]
Class Images Targets P R mAP@0.5 F1: 0%| | 0/8 [00:04<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 425, in
@wy3406
The dimensions of the image data you input are not exactly the same. For example, there are 100 groups of training data, of which 99 groups are 256 256, but one group is 384 384, which will lead to errors in the python checker
Another is the problem of cryptic batchsize. In Python, check that your training dimensions are correct according to the dimensions of each batchsize. For example, if you have 1000 groups of data (assuming that each group of data is an image of three channels of 256px 256px) and the batchsize is 4, then extract the tensor of (4,3,256,256) dimension for each training to train, just 250 epochs to solve (250 4 = 1000). But if you have 999 groups of data and you continue to use batchsize 4, 999 and 4 can't be divided. The tensor dimensions of 249 groups before training are (4,3256256), but the dimensions of the last batch are (3,3,256,256). Python checks that (4,3,256,256)! = (3,3,256,256). If the dimensions don't match, it will report an error. This can be called a small bug.
My suggestion: You should check the configuration of the full connection layer in your yolov3 configuration file
@wy3406
pull it.
And I recommand --weights ""
or --weights weights/model_best.pth
no run --weights weights/yolov3.weights
@wy3406 Is the dataset you are using an Oxford gesture recognition dataset?
@Lornatang No,is a custom datasets; The reason for "RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71" is because of cfg. file error; 6 means xywh + 2 categories ,it's the custom datasets 85 means xywh + 81 categories ,it's the coco datasets
@Lornatang No,is a custom datasets; The reason for "RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 6 and 85 in dimension 2 at /pytorch/aten/src/THC/generic/THCTensorMath.cu:71" is because of cfg. file error; 6 means xywh + 2 categories ,it's the custom datasets 85 means xywh + 81 categories ,it's the coco datasets
filter != 3 (num_classes + xywh + 1)? why you cal is 6? not `3 (2 + 4 + 1)=21?`
@Lornatang Yes, you are right filter setting in cfg.file is 255 Two categories indicate that one is the background
@wy3406 You are welcome to give me more suggestions. Thanks
Hi, I have prepared the images and labels to be placed separately ./data/custom/images and ./data/custom/label I see you have the arg.rect parameter,I don't know how to generate relative one