Open hkingtswcbyy opened 3 years ago
what is your training command
what is your training command
python train.py --device 0 --batch-size 16 --data coco.yaml --cfg yolov4-csp.cfg --weights '' --name yolov4-csp
exactly same as README.md of yolov4-csp branch
could you upload results.txt here. i get 45% AP and 65% AP50 at 170th epoch
could you upload results.txt here. i get 45% AP and 65% AP50 at 170th epoch
content in hyp.yaml: lr0: 0.01 momentum: 0.937 weight_decay: 0.0005 giou: 0.05 cls: 0.5 cls_pw: 1.0 obj: 1.0 obj_pw: 1.0 iou_t: 0.2 anchor_t: 4.0 fl_gamma: 0.0 hsv_h: 0.015 hsv_s: 0.7 hsv_v: 0.4 degrees: 0.0 translate: 0.0 scale: 0.5 shear: 0.0 perspective: 0.0 flipud: 0.0 fliplr: 0.5 mixup: 0.0
content in opt.yaml: weights: '' cfg: ./models/yolov4-csp.cfg data: ./data/coco.yaml hyp: data/hyp.scratch.yaml epochs: 300 batch_size: 16 img_size:
My coco dataset and labels are downloaded by using script of https://github.com/AlexeyAB/darknet/blob/master/scripts/get_coco2017.sh
i can not reproduce your results, and it seems all things are same between your and my training.
i only find a difference which is that i delete the empty labels of training data.
117266 found, 0 missing, 0 empty, 0 duplicate, for 117266 images
but i do not think so this will cause such big difference in ap.
i can not reproduce your results, and it seems all things are same between your and my training. i only find a difference which is that i delete the empty labels of training data.
117266 found, 0 missing, 0 empty, 0 duplicate, for 117266 images
but i do not think so this will cause such big difference in ap.
Can you share your label files to me? I want to verify it because I have no other ideas now.
i checked the files. i used same label files as ab. try to delete the old .cache files and re-generate them.
Thanks a lot for your reply! I will re-train the model with you advise, hoping that'll help.
@hkingtswcbyy any joy?
Thanks a lot for your reply! I will re-train the model with you advise, hoping that'll help.
so what's the result now?
Facing the exactly same issue ... @hkingtswcbyy @LukeAI @mpcv Have you solved it?
I just find if pytorch > 1.7.0 is used, it will get 'WARNING: smart bias initialization failure.' I am not sure if this is the reason which caused the lower mAP. The warning can be solved by edit https://github.com/WongKinYiu/ScaledYOLOv4/blob/yolov4-csp/models/models.py#L158-L167
try:
j = layers[yolo_index] if 'from' in mdef else -1
bias_ = module_list[j][0].bias # shape(255,)
bias = bias_[:modules.no * modules.na].view(modules.na, -1) # shape(3,85)
#bias[:, 4] += -4.5 # obj
bias.data[:, 4] += math.log(8 / (640 / stride[yolo_index]) ** 2) # obj (8 objects per 640 image)
bias.data[:, 5:] += math.log(0.6 / (modules.nc - 0.99)) # cls (sigmoid(p) = 1/nc)
module_list[j][0].bias = torch.nn.Parameter(bias_, requires_grad=bias_.requires_grad)
except:
print('WARNING: smart bias initialization failure.')
I am using pytorch=1.7.0, so I did not get that WARNING. Steps to reproduce:
git clone https://github.com/WongKinYiu/ScaledYOLOv4.git
python -m torch.distributed.launch --nproc_per_node 8 train.py --device 0,1,2,3,4,5,6,7 --batch-size 96 --data coco.yaml --cfg yolov4-csp.cfg --weights '' --name yolov4-csp --sync-bn
I did not change anything in the source code. After 300 epochs of training, mAP never reaches 0.4. However, pre-trained weights give almost the same mAP as in README.md. Note: Training is run inside a docker container, pytorch=1.7.0, cuda=10.2, DriverVersion=450.102.04, 8 Nvidia GeForce RTX 2080 Ti GPUS.
I got the same issue. Here are my results. results.txt
Hey @WongKinYiu I am using the docker image suggested in README. I am training with 8 V100 GPUs (I set the batch size to 64 or 128, but no significant difference. I think I have similar results with others. Here are the last 10 rows in my results.txt.
288/299 11.9G 0.03276 0.06324 0.01066 0.1067 70 640 0.3726 0.6887 0.5851 0.3786 0.03741 0.0587 0.01362
289/299 11.9G 0.03267 0.06322 0.01053 0.1064 34 640 0.3718 0.6887 0.5846 0.3782 0.03742 0.05877 0.01362
290/299 11.9G 0.03273 0.0637 0.01058 0.107 27 640 0.3726 0.6889 0.5853 0.3783 0.03743 0.0588 0.01363
291/299 11.9G 0.03273 0.06351 0.01062 0.1069 54 640 0.3734 0.6889 0.5856 0.3788 0.03744 0.05884 0.01363
292/299 11.9G 0.03262 0.06377 0.01055 0.1069 30 640 0.373 0.6875 0.5853 0.3786 0.03745 0.05889 0.01364
293/299 11.9G 0.03267 0.0639 0.01063 0.1072 35 640 0.3718 0.6887 0.5862 0.3786 0.03746 0.05891 0.01365
294/299 11.9G 0.03268 0.06371 0.01056 0.1069 23 640 0.3716 0.689 0.5865 0.3793 0.03747 0.05893 0.01366
295/299 11.9G 0.03267 0.06318 0.01049 0.1063 58 640 0.3721 0.6888 0.5857 0.3792 0.03748 0.05893 0.01366
296/299 11.9G 0.03256 0.06316 0.0105 0.1062 65 640 0.3726 0.6896 0.5858 0.3794 0.03747 0.05893 0.01366
297/299 11.9G 0.03265 0.06359 0.01048 0.1067 34 640 0.374 0.6889 0.5856 0.3791 0.03748 0.05894 0.01366
298/299 11.9G 0.0325 0.06327 0.01042 0.1062 41 640 0.373 0.6892 0.585 0.3784 0.03748 0.05894 0.01365
299/299 11.9G 0.03251 0.0634 0.01054 0.1064 74 640 0.373 0.6895 0.5846 0.3783 0.03749 0.05897 0.01366
As you can see, the mAP stays at around 0.38. Any suggestions?
Thanks.
I got the same issue too. In my case, I trained with image size 512x512.
288/299 8.33G 0.03481 0.09244 0.01262 0.1399 61 512 0.3494 0.6629 0.5522 0.3469 0.04183 0.08358 0.01528
289/299 8.33G 0.03483 0.09277 0.01255 0.1401 78 512 0.3494 0.6629 0.552 0.3474 0.04183 0.08356 0.01529
290/299 8.33G 0.03471 0.09283 0.01257 0.1401 87 512 0.35 0.6642 0.5525 0.3474 0.04183 0.08356 0.0153
291/299 8.33G 0.03467 0.09292 0.01259 0.1402 34 512 0.3486 0.6637 0.5516 0.3476 0.04182 0.08358 0.01531
292/299 8.33G 0.03475 0.09284 0.01257 0.1402 72 512 0.3494 0.6647 0.5522 0.3478 0.04181 0.0836 0.01532
293/299 8.33G 0.03475 0.09234 0.01268 0.1398 117 512 0.3493 0.6651 0.5526 0.3481 0.0418 0.08365 0.01532
294/299 8.33G 0.03464 0.09224 0.0125 0.1394 49 512 0.3485 0.6639 0.5522 0.3482 0.0418 0.08367 0.01532
295/299 8.33G 0.03469 0.09242 0.01245 0.1396 52 512 0.3491 0.6634 0.5522 0.3482 0.0418 0.08366 0.01532
296/299 8.33G 0.03468 0.09214 0.01252 0.1393 35 512 0.3497 0.6636 0.5525 0.3485 0.04181 0.08369 0.01532
297/299 8.33G 0.03468 0.09207 0.01249 0.1392 45 512 0.3501 0.6625 0.5521 0.3481 0.04182 0.08373 0.01534
298/299 8.33G 0.03463 0.09239 0.01247 0.1395 71 512 0.3497 0.6617 0.5521 0.3484 0.04181 0.08378 0.01535
299/299 8.33G 0.03481 0.09261 0.01246 0.1399 54 512 0.3494 0.6622 0.5514 0.3481 0.04183 0.08378 0.01536
If you switch to the yolov4-large branch and use the yolov4-csp.yaml as the configuration file. In this branch, I am able to get the following result:
289/299 9.88G 0.03466 0.05194 0.01261 0.09921 33 640 0.4347 0.7318 0.655 0.4546 0.033 0.05124 0.01163
290/299 9.88G 0.03457 0.05188 0.0125 0.09895 16 640 0.4335 0.731 0.6548 0.4547 0.03301 0.05125 0.01163
291/299 9.88G 0.0346 0.05206 0.01259 0.09925 33 640 0.435 0.7321 0.6551 0.4543 0.03301 0.05126 0.01163
292/299 9.88G 0.03466 0.05209 0.01265 0.09939 20 640 0.4363 0.732 0.6553 0.454 0.03301 0.05127 0.01164
293/299 9.88G 0.03463 0.05208 0.01262 0.09933 34 640 0.4356 0.7313 0.655 0.4542 0.03301 0.05128 0.01165
294/299 9.88G 0.03458 0.05225 0.01261 0.09944 28 640 0.4361 0.7297 0.6553 0.4548 0.03301 0.05128 0.01164
295/299 9.88G 0.03457 0.05149 0.01247 0.09853 48 640 0.4352 0.73 0.655 0.4547 0.03299 0.05129 0.01165
296/299 9.88G 0.03441 0.05155 0.01263 0.09858 58 640 0.4355 0.7305 0.6549 0.4541 0.03299 0.05131 0.01165
297/299 9.88G 0.03445 0.05223 0.01243 0.0991 21 640 0.4352 0.7303 0.6559 0.455 0.03298 0.0513 0.01166
298/299 9.88G 0.03443 0.05173 0.01252 0.09868 23 640 0.4338 0.7309 0.6558 0.4549 0.03297 0.05131 0.01166
299/299 9.88G 0.03446 0.05162 0.01256 0.09864 29 640 0.4343 0.7298 0.6558 0.455 0.03297 0.05132 0.01166
The mAP is about 0.456, which is pretty close to the released model in darknet format.
I succeed to train yolov4-csp with yolov4-large branch, but I don't know how to use it with darknet.
How can I convert a .pt to darknet weight? Because yolov4-large uses yolov4-csp.yaml instead of using yolov4-csp.cfg, I don't know how to convert it to darknet weight.
source code is updated https://github.com/WongKinYiu/ScaledYOLOv4/tree/yolov4-csp#yolov4-csp
has anyone been able to find the difference between the broken state and the working state?
I believe all the steps were followed correctly and based on the description. I am able to place in my custom images into the appropriate folder and run the training. However, when the training has reached the end of the 3rd epoch, it shows the error as shown in the image. My dataset was annotated in roboflow and downloaded with the yolov4-scaled format.
settings i use:
steps I have taken with no avail:
The error is : 'trying to create tensor with negative dimension: -928589440"
I trained yolov4-csp on COCO2017 dataset. The training setting is exactly same as the instructions in the README.md of yolov4-csp branch. Firstly I tried multiple GPUs training, the final val mAP after 300 epochs is about 0.4. now I am training with one GPU, the val mAP is stucked at 0.384 after 170 epochs. Can you give me some advise to reproduce the result of the yolov4-csp.weights you provide? I evaluated yolov4-csp.weights and get an mAP of 0.478, which is almost same as mAP reported in paper.