How to reproduce and benchmark YOLOv7-tiny

NatanBagrov commented 2 years ago

Hello,

Thanks for releasing YOLOv7!

I'm failing to reproduce the results of the tiny model. The goal is this blue point: [twitter]

From what I understand: YOLOv7 reaches 35.2 mAP on 416x416 and beats the 35 mAP that YOLOv6 reaches on 640x640 resolution. Is that correct?

Steps:

I've launched a training using the snippet from this comment: https://github.com/WongKinYiu/yolov7/issues/106#issuecomment-1182646041

python train.py --workers 8 --device 0 --batch-size 64 --data data/coco.yaml --img 512 512 --cfg cfg/training/yolov7-tiny.yaml --weights '' --name yolov7-tiny --hyp data/hyp.scratch.tiny.yaml

The model reached 34.31 mAP (on 512x512):

wandb: Waiting for W&B process to finish, PID 3923210... (success).
wandb:
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁▄▄▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb:   metrics/mAP_0.5:0.95 ▁▃▄▅▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇██████████████
wandb:      metrics/precision ▁▄▄▅▆▆▆▆▇▇▆▆▆▆▆▆▆▇▇▆▇▆▇▇▇▇▇█▇▇▇▆▇█▇▇▇▇▇▇
wandb:         metrics/recall ▁▃▄▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇█▇███
wandb:         train/box_loss █▆▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
wandb:         train/cls_loss █▅▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
wandb:         train/obj_loss █▇▆▅▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁
wandb:           val/box_loss █▅▄▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/cls_loss █▅▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/obj_loss █▃▄▅▄▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▄▄▄▄▅▅▆▆▇▇▇
wandb:                  x/lr0 ███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                  x/lr1 ███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                  x/lr2 ███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb:        metrics/mAP_0.5 0.53092
wandb:   metrics/mAP_0.5:0.95 0.34311
wandb:      metrics/precision 0.61165
wandb:         metrics/recall 0.51045
wandb:         train/box_loss 0.03304
wandb:         train/cls_loss 0.01381
wandb:         train/obj_loss 0.03396
wandb:           val/box_loss 0.05588
wandb:           val/cls_loss 0.03234
wandb:           val/obj_loss 0.05537
wandb:                  x/lr0 0.0001
wandb:                  x/lr1 0.0001
wandb:                  x/lr2 0.0001

Then, I launched:

python test.py --data data/coco.yaml --img 416 --batch 32 --conf 0.001 --iou 0.65 --device 0 --weights runs/train/yolov7-tiny_512/weights/best.pt --name yolov7-tiny_416_val

The COCO-API mAP was 33.7:


Namespace(augment=False, batch_size=32, conf_thres=0.001, data='data/coco.yaml', device='0', exist_ok=False, img_size=416, iou_thres=0.65, name='yolov7-tiny_416_val', no_trace=False, project='runs/test', save_conf=False, save_hybrid=False, save_json=True, save_txt=False, single_cls=False, task='val', verbose=False, weights=['runs/train/yolov7-tiny_512/weights/best.pt'])
YOLOR 🚀 2022-7-8 torch 1.10.0 CUDA:0 (NVIDIA RTX A5000, 24256.3125MB)

Fusing layers... Model Summary: 208 layers, 6221370 parameters, 0 gradients, 13.7 GFLOPS Convert model to Traced-model... traced_script_module saved! model is traced!

val: Scanning 'coco/val2017' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupted: 100%|█| 5000/5000 [00:01<00:00, Class Images Labels P R mAP@.5 mAP@.5:.95: 100%|█| 157/157 [01:53<00:00, 1. all 5000 36335 0.605 0.483 0.507 0.325 Speed: 0.7/2.5/3.2 ms inference/NMS/total per 416x416 image at batch-size 32

Evaluating pycocotools mAP... saving runs/test/yolov7-tiny_416_val2/best_predictions.json... loading annotations into memory... Done (t=0.40s) creating index... index created! Loading and preparing results... DONE (t=6.62s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=83.83s). Accumulating evaluation results... DONE (t=17.25s). Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.337 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.513 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.356 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.137 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.360 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.520 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.291 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.477 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.525 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.287 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.582 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.732 Results saved to runs/test/yolov7-tiny_416_val2



Questions:
* What is missing?
* Shouldn't we use Anchors matching for 416x416? It's not a big deal, but anyway...
* Is it possible to have an aggregated place where we can understand how to reproduce stuff?

WongKinYiu commented 2 years ago

cfg and weights of YOLOv7-tiny are in darknet branch. You could use PyTorch_YOLOv4 or YOLOR to reproduce the performance.

Yes we already set anchors for 416x416 and training resolution is 512x512.

And actually our batch 32 average inference time is 0.361 ms, but due to mt-yolov6 only provide 0.1f precision, we follow their protocol to show our batch 32 average inference time as 0.4 ms.

NatanBagrov commented 2 years ago

Thanks for the response.

So I should train yolov7-tiny using the code that you posted, but to evaluate it in a different repo?

AlexeyAB commented 2 years ago

@NatanBagrov

Did you deleted previoustrain2017.cache and val2017.cache files, and redownload labels ?
Did you use the latest code from this repo?

NatanBagrov commented 2 years ago

@NatanBagrov

Did you deleted previoustrain2017.cache and val2017.cache files, and redownload labels ?

Did you use the latest code from this repo?

Thanks for the reply, @AlexeyAB !

I did use the latest commit. Did not make changes to data. Will do now and launch a training. What mAP should I expect with tiny on 512x512? And, is that correct that I should launch a training on 512? This tiny is a bit mess in terms of understanding how to train and evaluate...

[Update] trying to reproduce with new labels and latest commit cause this error mid-training:

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     0/299     2.29G   0.09101     2.545     1.047     3.684       207       512:   0%|     | 1/1849 [00:03<1:50:13,  3.58s/it]Reducer buckets have been rebuil
t in this iteration.
     0/299     2.19G   0.06374   0.07147    0.2578    0.3931        44       512: 100%|████| 1849/1849 [07:35<00:00,  4.06it/s]
               Class      Images      Labels           P           R      mAP@.5  mAP@.5:.95: 100%|█| 157/157 [00:32<00:00,  4.
                 all        5000       36335       0.769      0.0178     0.00805     0.00236

     Epoch   gpu_mem       box       obj       cls     total    labels  img_size
     1/299     2.22G   0.05101   0.04385    0.0683    0.1632       367       512:  43%|██▏  | 801/1849 [03:13<04:23,  3.98it/s]Traceback (most recent call last
):
  File "train.py", line 609, in <module>
    train(hyp, opt, device, tb_writer)
  File "train.py", line 336, in train
    for i, (imgs, targets, paths, _) in pbar:  # batch -------------------------------------------------------------
  File "/home/ubuntu/PycharmProjects/yolov7/utils/datasets.py", line 110, in __iter__
    yield next(self.iterator)
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
cv2.error: Caught error in DataLoader worker process 3.
Original Traceback (most recent call last):
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/.conda/envs/natan-development/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/PycharmProjects/yolov7/utils/datasets.py", line 600, in __getitem__
    labels = pastein(img, labels, sample_labels, sample_images, sample_masks)
  File "/home/ubuntu/PycharmProjects/yolov7/utils/datasets.py", line 1199, in pastein
    r_mask = cv2.resize(sample_masks[sel_ind], (r_w, r_h))
cv2.error: OpenCV(4.5.4) /tmp/pip-req-build-3129w7z7/opencv/modules/core/src/matrix.cpp:466: error: (-215:Assertion failed) _step >= minstep in function 'Mat'

Reproduce:

Checkout 4f6e390
Delete previous train2017.cache and val2017.cache files, and redownload labels

Run:

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train.py --workers 8 --device 4,5,6,7 --sync-bn --batch-size 64 --data data/coco.yaml --img 512 512 --cfg cfg/training/yolov7-tiny.yaml  --weights '' --name yolov7-tiny_512_new_data --hyp data/hyp.scratch.tiny.yaml

Workaround:

Wrap the content of if (r_w > 10) and (r_h > 10) with a try-except statement. This happens up to once per epoch for me. So I assume nothing big

NatanBagrov commented 2 years ago

To update, @AlexeyAB , Training with the latest (4f6e390) commit, and the updated labels results in 37.4 mAP on 512.

wandb: Waiting for W&B process to finish, PID 1890509... (success).
wandb:
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁▄▅▅▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb:   metrics/mAP_0.5:0.95 ▁▃▄▅▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb:      metrics/precision ▁▄▅▅▆▆▆▆▇▇▆▆▆▇▇▆▇▆▇▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇██▇▇▇▇
wandb:         metrics/recall ▁▃▄▅▅▆▆▆▆▆▇▆▇▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇████
wandb:         train/box_loss █▆▄▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
wandb:         train/cls_loss █▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
wandb:         train/obj_loss █▇▆▅▅▅▅▅▅▅▅▄▄▄▄▄▄▄▄▄▄▄▄▄▃▃▃▃▃▃▃▃▂▂▂▂▁▁▁▁
wandb:           val/box_loss █▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/cls_loss █▅▄▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/obj_loss ▆▃▅▆▄▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▃▃▃▃▄▄▄▅▅▆▆▇▇██
wandb:                  x/lr0 ███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                  x/lr1 ███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:                  x/lr2 ███████▇▇▇▇▇▇▆▆▆▆▅▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁
wandb:
wandb: Run summary:
wandb:        metrics/mAP_0.5 0.53412
wandb:   metrics/mAP_0.5:0.95 0.34786
wandb:      metrics/precision 0.61114
wandb:         metrics/recall 0.51673
wandb:         train/box_loss 0.03133
wandb:         train/cls_loss 0.01513
wandb:         train/obj_loss 0.03491
wandb:           val/box_loss 0.05499
wandb:           val/cls_loss 0.03209
wandb:           val/obj_loss 0.05549
wandb:                  x/lr0 0.0001
wandb:                  x/lr1 0.0001
wandb:                  x/lr2 0.0001

floveqq commented 2 years ago

@NatanBagrov did you use this line for training and got 37.4 evaling on 640? or eval on 512 too ?

python train.py --workers 8 --device 0 --batch-size 64 --data data/coco.yaml --img 512 512 --cfg cfg/training/yolov7-tiny.yaml --weights '' --name yolov7-tiny --hyp data/hyp.scratch.tiny.yaml

Have you got 38.7 as in paper trained by this main branch rather than darknet branch?

WongKinYiu / yolov7

How to reproduce and benchmark YOLOv7-tiny #233