My val mAP on coco2017 is much lower than mAP reported in paper when training from scratch use YOLOV4-CSP

hkingtswcbyy commented 3 years ago

I trained yolov4-csp on COCO2017 dataset. The training setting is exactly same as the instructions in the README.md of yolov4-csp branch. Firstly I tried multiple GPUs training, the final val mAP after 300 epochs is about 0.4. now I am training with one GPU, the val mAP is stucked at 0.384 after 170 epochs. Can you give me some advise to reproduce the result of the yolov4-csp.weights you provide? I evaluated yolov4-csp.weights and get an mAP of 0.478, which is almost same as mAP reported in paper.

WongKinYiu commented 3 years ago

what is your training command

hkingtswcbyy commented 3 years ago

what is your training command

python train.py --device 0 --batch-size 16 --data coco.yaml --cfg yolov4-csp.cfg --weights '' --name yolov4-csp

exactly same as README.md of yolov4-csp branch

WongKinYiu commented 3 years ago

could you upload results.txt here. i get 45% AP and 65% AP50 at 170th epoch

hkingtswcbyy commented 3 years ago

could you upload results.txt here. i get 45% AP and 65% AP50 at 170th epoch

results.txt

content in hyp.yaml: lr0: 0.01 momentum: 0.937 weight_decay: 0.0005 giou: 0.05 cls: 0.5 cls_pw: 1.0 obj: 1.0 obj_pw: 1.0 iou_t: 0.2 anchor_t: 4.0 fl_gamma: 0.0 hsv_h: 0.015 hsv_s: 0.7 hsv_v: 0.4 degrees: 0.0 translate: 0.0 scale: 0.5 shear: 0.0 perspective: 0.0 flipud: 0.0 fliplr: 0.5 mixup: 0.0

content in opt.yaml: weights: '' cfg: ./models/yolov4-csp.cfg data: ./data/coco.yaml hyp: data/hyp.scratch.yaml epochs: 300 batch_size: 16 img_size:

640
640 rect: false resume: false nosave: false notest: false noautoanchor: false evolve: false bucket: '' cache_images: false name: yolov4-csp device: '0' multi_scale: false single_cls: false adam: false sync_bn: false local_rank: -1 logdir: runs/ total_batch_size: 16 world_size: 1 global_rank: -1

My coco dataset and labels are downloaded by using script of https://github.com/AlexeyAB/darknet/blob/master/scripts/get_coco2017.sh

Screenshot from 2020-12-12 11-31-35

WongKinYiu commented 3 years ago

i can not reproduce your results, and it seems all things are same between your and my training. i only find a difference which is that i delete the empty labels of training data. 117266 found, 0 missing, 0 empty, 0 duplicate, for 117266 images but i do not think so this will cause such big difference in ap.

hkingtswcbyy commented 3 years ago

i can not reproduce your results, and it seems all things are same between your and my training. i only find a difference which is that i delete the empty labels of training data. 117266 found, 0 missing, 0 empty, 0 duplicate, for 117266 images but i do not think so this will cause such big difference in ap.

Can you share your label files to me? I want to verify it because I have no other ideas now.

WongKinYiu commented 3 years ago

i checked the files. i used same label files as ab. try to delete the old .cache files and re-generate them.

hkingtswcbyy commented 3 years ago

Thanks a lot for your reply! I will re-train the model with you advise, hoping that'll help.

LukeAI commented 3 years ago

@hkingtswcbyy any joy?

pengchengma commented 3 years ago

Thanks a lot for your reply! I will re-train the model with you advise, hoping that'll help.

so what's the result now?

kanybekasanbekov commented 3 years ago

Facing the exactly same issue ... @hkingtswcbyy @LukeAI @mpcv Have you solved it?

WongKinYiu commented 3 years ago

I just find if pytorch > 1.7.0 is used, it will get 'WARNING: smart bias initialization failure.' I am not sure if this is the reason which caused the lower mAP. The warning can be solved by edit https://github.com/WongKinYiu/ScaledYOLOv4/blob/yolov4-csp/models/models.py#L158-L167

            try:
                j = layers[yolo_index] if 'from' in mdef else -1
                bias_ = module_list[j][0].bias  # shape(255,)
                bias = bias_[:modules.no * modules.na].view(modules.na, -1)  # shape(3,85)
                #bias[:, 4] += -4.5  # obj
                bias.data[:, 4] += math.log(8 / (640 / stride[yolo_index]) ** 2)  # obj (8 objects per 640 image)
                bias.data[:, 5:] += math.log(0.6 / (modules.nc - 0.99))  # cls (sigmoid(p) = 1/nc)
                module_list[j][0].bias = torch.nn.Parameter(bias_, requires_grad=bias_.requires_grad)
            except:
                print('WARNING: smart bias initialization failure.')

kanybekasanbekov commented 3 years ago

I am using pytorch=1.7.0, so I did not get that WARNING. Steps to reproduce:

git clone https://github.com/WongKinYiu/ScaledYOLOv4.git
Install MishCuda as written in README.md
Download dataset from https://github.com/AlexeyAB/darknet/blob/master/scripts/get_coco2017.sh
Start training => python -m torch.distributed.launch --nproc_per_node 8 train.py --device 0,1,2,3,4,5,6,7 --batch-size 96 --data coco.yaml --cfg yolov4-csp.cfg --weights '' --name yolov4-csp --sync-bn

I did not change anything in the source code. After 300 epochs of training, mAP never reaches 0.4. However, pre-trained weights give almost the same mAP as in README.md. Note: Training is run inside a docker container, pytorch=1.7.0, cuda=10.2, DriverVersion=450.102.04, 8 Nvidia GeForce RTX 2080 Ti GPUS.

thanhnguyentung95 commented 3 years ago

I got the same issue. Here are my results. results.txt

superaha commented 3 years ago

Hey @WongKinYiu I am using the docker image suggested in README. I am training with 8 V100 GPUs (I set the batch size to 64 or 128, but no significant difference. I think I have similar results with others. Here are the last 10 rows in my results.txt.

   288/299     11.9G   0.03276   0.06324   0.01066    0.1067        70       640    0.3726    0.6887    0.5851    0.3786   0.03741    0.0587   0.01362
   289/299     11.9G   0.03267   0.06322   0.01053    0.1064        34       640    0.3718    0.6887    0.5846    0.3782   0.03742   0.05877   0.01362
   290/299     11.9G   0.03273    0.0637   0.01058     0.107        27       640    0.3726    0.6889    0.5853    0.3783   0.03743    0.0588   0.01363
   291/299     11.9G   0.03273   0.06351   0.01062    0.1069        54       640    0.3734    0.6889    0.5856    0.3788   0.03744   0.05884   0.01363
   292/299     11.9G   0.03262   0.06377   0.01055    0.1069        30       640     0.373    0.6875    0.5853    0.3786   0.03745   0.05889   0.01364
   293/299     11.9G   0.03267    0.0639   0.01063    0.1072        35       640    0.3718    0.6887    0.5862    0.3786   0.03746   0.05891   0.01365
   294/299     11.9G   0.03268   0.06371   0.01056    0.1069        23       640    0.3716     0.689    0.5865    0.3793   0.03747   0.05893   0.01366
   295/299     11.9G   0.03267   0.06318   0.01049    0.1063        58       640    0.3721    0.6888    0.5857    0.3792   0.03748   0.05893   0.01366
   296/299     11.9G   0.03256   0.06316    0.0105    0.1062        65       640    0.3726    0.6896    0.5858    0.3794   0.03747   0.05893   0.01366
   297/299     11.9G   0.03265   0.06359   0.01048    0.1067        34       640     0.374    0.6889    0.5856    0.3791   0.03748   0.05894   0.01366
   298/299     11.9G    0.0325   0.06327   0.01042    0.1062        41       640     0.373    0.6892     0.585    0.3784   0.03748   0.05894   0.01365
   299/299     11.9G   0.03251    0.0634   0.01054    0.1064        74       640     0.373    0.6895    0.5846    0.3783   0.03749   0.05897   0.01366

As you can see, the mAP stays at around 0.38. Any suggestions?

Thanks.

chjej202 commented 3 years ago

I got the same issue too. In my case, I trained with image size 512x512.

   288/299     8.33G   0.03481   0.09244   0.01262    0.1399        61       512    0.3494    0.6629    0.5522    0.3469   0.04183   0.08358   0.01528
   289/299     8.33G   0.03483   0.09277   0.01255    0.1401        78       512    0.3494    0.6629     0.552    0.3474   0.04183   0.08356   0.01529
   290/299     8.33G   0.03471   0.09283   0.01257    0.1401        87       512      0.35    0.6642    0.5525    0.3474   0.04183   0.08356    0.0153
   291/299     8.33G   0.03467   0.09292   0.01259    0.1402        34       512    0.3486    0.6637    0.5516    0.3476   0.04182   0.08358   0.01531
   292/299     8.33G   0.03475   0.09284   0.01257    0.1402        72       512    0.3494    0.6647    0.5522    0.3478   0.04181    0.0836   0.01532
   293/299     8.33G   0.03475   0.09234   0.01268    0.1398       117       512    0.3493    0.6651    0.5526    0.3481    0.0418   0.08365   0.01532
   294/299     8.33G   0.03464   0.09224    0.0125    0.1394        49       512    0.3485    0.6639    0.5522    0.3482    0.0418   0.08367   0.01532
   295/299     8.33G   0.03469   0.09242   0.01245    0.1396        52       512    0.3491    0.6634    0.5522    0.3482    0.0418   0.08366   0.01532
   296/299     8.33G   0.03468   0.09214   0.01252    0.1393        35       512    0.3497    0.6636    0.5525    0.3485   0.04181   0.08369   0.01532
   297/299     8.33G   0.03468   0.09207   0.01249    0.1392        45       512    0.3501    0.6625    0.5521    0.3481   0.04182   0.08373   0.01534
   298/299     8.33G   0.03463   0.09239   0.01247    0.1395        71       512    0.3497    0.6617    0.5521    0.3484   0.04181   0.08378   0.01535
   299/299     8.33G   0.03481   0.09261   0.01246    0.1399        54       512    0.3494    0.6622    0.5514    0.3481   0.04183   0.08378   0.01536

superaha commented 3 years ago

If you switch to the yolov4-large branch and use the yolov4-csp.yaml as the configuration file. In this branch, I am able to get the following result:

   289/299     9.88G   0.03466   0.05194   0.01261   0.09921        33       640    0.4347    0.7318     0.655    0.4546     0.033   0.05124   0.01163
   290/299     9.88G   0.03457   0.05188    0.0125   0.09895        16       640    0.4335     0.731    0.6548    0.4547   0.03301   0.05125   0.01163
   291/299     9.88G    0.0346   0.05206   0.01259   0.09925        33       640     0.435    0.7321    0.6551    0.4543   0.03301   0.05126   0.01163
   292/299     9.88G   0.03466   0.05209   0.01265   0.09939        20       640    0.4363     0.732    0.6553     0.454   0.03301   0.05127   0.01164
   293/299     9.88G   0.03463   0.05208   0.01262   0.09933        34       640    0.4356    0.7313     0.655    0.4542   0.03301   0.05128   0.01165
   294/299     9.88G   0.03458   0.05225   0.01261   0.09944        28       640    0.4361    0.7297    0.6553    0.4548   0.03301   0.05128   0.01164
   295/299     9.88G   0.03457   0.05149   0.01247   0.09853        48       640    0.4352      0.73     0.655    0.4547   0.03299   0.05129   0.01165
   296/299     9.88G   0.03441   0.05155   0.01263   0.09858        58       640    0.4355    0.7305    0.6549    0.4541   0.03299   0.05131   0.01165
   297/299     9.88G   0.03445   0.05223   0.01243    0.0991        21       640    0.4352    0.7303    0.6559     0.455   0.03298    0.0513   0.01166
   298/299     9.88G   0.03443   0.05173   0.01252   0.09868        23       640    0.4338    0.7309    0.6558    0.4549   0.03297   0.05131   0.01166
   299/299     9.88G   0.03446   0.05162   0.01256   0.09864        29       640    0.4343    0.7298    0.6558     0.455   0.03297   0.05132   0.01166

The mAP is about 0.456, which is pretty close to the released model in darknet format.

chjej202 commented 3 years ago

I succeed to train yolov4-csp with yolov4-large branch, but I don't know how to use it with darknet.

How can I convert a .pt to darknet weight? Because yolov4-large uses yolov4-csp.yaml instead of using yolov4-csp.cfg, I don't know how to convert it to darknet weight.

WongKinYiu commented 3 years ago

source code is updated https://github.com/WongKinYiu/ScaledYOLOv4/tree/yolov4-csp#yolov4-csp

vishnubanna commented 3 years ago

has anyone been able to find the difference between the broken state and the working state?

hammockrobotics commented 8 months ago

I believe all the steps were followed correctly and based on the description. I am able to place in my custom images into the appropriate folder and run the training. However, when the training has reached the end of the 3rd epoch, it shows the error as shown in the image. My dataset was annotated in roboflow and downloaded with the yolov4-scaled format.

settings i use:

using the Mish weights

steps I have taken with no avail:

Editing the dataset to a completely new one from Roboflow thinking my dataset was corrupted.
Renaming the dataset labels and images to simpler 01.jpg - 01.txt , 02.jpg - 02.txt and so on...

The error is : 'trying to create tensor with negative dimension: -928589440"

WongKinYiu / ScaledYOLOv4

My val mAP on coco2017 is much lower than mAP reported in paper when training from scratch use YOLOV4-CSP #89