EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO

AlexeyAB commented 4 years ago

EfficientDet: Scalable and Efficient Object Detection

paper: https://arxiv.org/abs/1911.09070v1

First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion;

LukeAI commented 4 years ago

Looks really promising! The GPU latencies given are very low but it uses efficientnet as the backbone - how could that be?

isra60 commented 4 years ago

So this could be implemented in this darknet repository??? I'm a little confused.

tianfengyijiu commented 4 years ago

How to get the best mAP@50 until now? Can I use EfficientDet-D0 - D6? I use yolov3-voc.cfg to train myself dataset and get mAP@50=80. on myself test set. I just add three lines: flip = 1 letter_box=1 mixup = 1 thinks a lot! @AlexeyAB

WongKinYiu commented 4 years ago

@AlexeyAB code released.

https://github.com/google/automl/tree/master/efficientdet

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu guys I might have an interesting clue about increasing mAP.

Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects.

On the same topic, I recently added test-time augmentation to my repo https://github.com/ultralytics/yolov3/issues/931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image. The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist).

So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think?

WongKinYiu commented 4 years ago

@glenn-jocher Hello,

Yes, from my previous analysis, I think we need at least one more scale (P6). I already start training P3-P6 model for several days. But there also a issue I have ignored first: The input size may should be 64x instead of 32x.

I saw your modification yesterday, and I already integrated YOLOv3-SPP into mmdetection. There are two different tricks are used in ultralytics and mmdetection. In ultralytics: train from scratch + prebias. In mmdetection: with pretrained model but only BatchNorm layers will be update. I would like to examine the performance of these two tricks. Also, I will apply tricks in ATSS to YOLOv3-SPP if it need not to modify much code.

By the way, CSPDarkNet53 has also be integrated with TTFNet (anchor-free object detection), MSRCNN (instance segmentation), and JDE (simultaneously detect and track).

OK, I am training YOLOv3-SPP with almost same setting as CSPResNeXt50-PANet-SPP(optimal) using ultralytics. I think it can be baseline of your new model which integrate P3-P7.

AlexeyAB commented 4 years ago

@glenn-jocher @WongKinYiu

Yes, it seems that https://arxiv.org/abs/1909.00700v3 and https://github.com/ZJULearning/ttfnet (Training-Time-Friendly Network for Real-Time Object Detection) is a good way.

While I think we must move on:

use triplet-loss for train in ~10 seconds per object
use lstm to achieve +10-30% AP for Detectio on video and Training on static images

On the same topic, I recently added test-time augmentation to my repo ultralytics/yolov3#931, which increased mAP from 42.4 to 44.7. I tested many different options, and settled on 2 winners: a left-right flip, and a 0.70 scale image.

Yes the same result as for CenterNet, they achieve 40.3% AP / 14 FPS, then with Flip - they achieve 42.2% AP / 7.8 FPS, and with Multi-scale they achieve 45.1% AP / 1.4 FPS. https://github.com/xingyizhou/CenterNet#object-detection-on-coco-validation But this is not one-stage detector already and makes the model not real time. While I think about flip-invariance/rotation-invariance/scale-invariance weights with significant increasing accuracy and a small drop in FPS: https://github.com/AlexeyAB/darknet/issues/4495#issuecomment-578538967

The highest mAP increase was coming from the larger objects. I think the 0.70 scale made these large objects smaller so they could fit in the P5 layer (whereas maybe before they would have needed to be in the P6 layer which doesn't exist).

May be yes.

May be we just should add P6.
But may be we should also increase network-resolution and weights-size for optimal AP/FPS as it is stated in EfficientNet/Det papers

Efficientdet has 5 outputs (P3-P7) compared to 3 (P3-P5) for yolov3, but these extra 2 are for larger objects, not smaller objects. In the past I've added a 4th layer to yolov3, with the same or slightly worse results, but this was for smaller objects.

Yes, it is because, they increased input network resolution from 512x512 for D0 (where are P3-P5 for big objects) to 1536x1536 for D7 (where are P3-P5 for small objects), so we should add P6-P7 for big objects. Because receptive field NxN of P5 don't depend on network resolution and remains the same in pixels (look below for yolo cfg files), so NxN for 512x512 is big, while the same NxN for 1536x1536 is small.

So may be we should:

increase network resolution width=896 height=896
add P6 yolo-layer with higher receptive field, so we have 4 [yolo]-layers
increase ~1.35x weights size (filters= for each conv-layer)
use 12 or 15 anchors for 4 yolo-layers

So my proposal is (since the darknet53-bifpn cfg did not help), to simply add P6 and P7 outputs to yolov3-spp.cfg and test this out (with the same anchors redistributed among the layers I suppose). What do you think?

Yes, try 4 ways:

yolov3-spp + P6
yolov3-spp + P6 + network resolutionn 896x896
yolov3-bifpnn + P6-P7
yolov3-bifpnn + P6-P7 + network resolutionn 896x896

I added receptive field calculation - usage: [net] show_receptive_field=1 in cfg-file:

yolov3-tiny.cfg
1. yolo-layer 13x13: 318x318
2. yolo-layer 26x26: 286x286

yolov3.cfg
1. yolo-layer 13x13: 917x917
2. yolo-layer 26x26: 949x949
3. yolo-layer 26x26: 965x965

yolov3-spp.cfg
1. yolo-layer 13x13: 1301x1301
2. yolo-layer 26x26: 1333x1333
3. yolo-layer 26x26: 1349x1349

While input network size if just 608x608.

So for [net] width=608 height=608 the 1st yolo-layer 13x13: 1301x1301 is for big objects
But for [net] width=3200 height=3200 the 1st yolo-layer 13x13: 1301x1301 is for small objects

glenn-jocher commented 4 years ago

@WongKinYiu ah great! Lots of integrations going on. I have not looked at ATSS yet, I will check it out. TTFNet looks refreshingly simple.

Yes that's a good chart you have there. How do you calculate the receptive field exactly? I saw Efficientdet updated their anchor ratios to (1.0, 1.0), (1.4, 0.7), (0.7, 1.4). I'm not sure exactly how these work. Do you think these create anchors based on multiplying grid cells or the receptive field?

Yes a P6 layer requires 64-multiple size images, and a P7 layer would require 128-multiple size images, but its not a huge problem.

WongKinYiu commented 4 years ago

there is a 3x3 convolutional layer just before prediction layer, so i simply multiple size of grid by 3.

WongKinYiu commented 4 years ago

@AlexeyAB

from JDE paper, they provide results of different loss for embedding. but unfortunately, they only submit code of cross entropy.

Also there are some issues to be solved. for example, it only support single class tracking and different anchors in same scale share same embedded feature.

the code is mainly based on ultralytics, so i think it can be start point for developing triplet-loss based tracker. https://github.com/Zhongdao/Towards-Realtime-MOT

glenn-jocher commented 4 years ago

@WongKinYiu so you simply take a 3x3 grid as the receptive field. Ok.

Do you think it might be beneficial to have the anchors be fixed in units of gridspace instead of image space? Maybe this is what EfficientDet is doing with their (1,1), (1.4, 0.7), (0.7, 1.4) anchor multiples (I don't know what they do with these multiples).

Right now the anchors are fixed/defined in imagespace (pixels) rather than grid space, so the same anchor would take up varying gridpoints depending on the output layer (if it was applied to different layers).

What do you think of the idea defining the anchors as (1,1), (1.4, 0.7), (0.7, 1.4) local gridpoints, and then maybe testing out say a 2x and 3x multiple of that?

I implemented my yolov3-spp-p6 in other news, I'm training it now. I trimmed some of the convolutions to keep the size maneagable, its 81M params now and training about 25% slower than normal. Early mAP was lower, but seems to be crossing yolov3-spp and going higher at around 50 epochs. I'll keep my fingers crossed.

WongKinYiu commented 4 years ago

@glenn-jocher

from my previous analysis, i think {0.7,1.4} is due to IoU >= 0.5. and sqrt(0.5)*sqrt(2) equals to 1, (0.7, 1.4), (1.4, 0.7), (1,1) almost has same area.

glenn-jocher commented 4 years ago

@WongKinYiu ah yes that makes sense! Also (1.4, 0.7) IOU is about 0.55, close to 0.5. From your earlier plots though it looks like the current anchors correspond much better to about 3x3 gridpoints than 1x1 gridpoints.

At 512x512, the P3 grid is 64x64, P4 is 32x32, P5 is 16x16, and P6 is 8x8. If we had a P7 that would be 4x4, and 3 gridpoints at that scale would take up almost the entire image (which sounds about right). At the smaller scale though the P3 stride is 8, and we currently have anchors about that size (smallest is 10x13 pixels).

I'm worried my existing anchors are causing tension in the P6 model, as the GIoU loss is higher than normal. I simply spread out the 12 anchors I was using for yolov4.cfg (which has P3-P5, 4 at each level) to yolov4-p6 (which has P3-P6, 3 anchors at each level).

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu ok, my P6 experiment was tracking worse than yolov3-spp after about 150 epochs so I cancelled it. I'm not sure why exactly.

If I look at the yolov3-spp receptive field, at P5, stride 32, the largest anchor is (373,326), or 10 grids, which would be 3X the receptive field according to @WongKinYiu

P6 has stride 64, so only 1.5X receptive field for the largest anchor, yet overall mAP is worse. I did trim some convolution operations to keep the parameter count reasonable, so this could be the cause. Back to the drawing board I guess. @WongKinYiu how did your P6 experiment go?

WongKinYiu commented 4 years ago

@glenn-jocher

currently 140k iterations, it need several weeks to finish training.

for yolov3-spp, the receptive filed become very large due to spp module is added. (13x13 max conv -> (32 13)x(32 13)) = 416x416 receptive field)

AlexeyAB commented 4 years ago

@WongKinYiu @glenn-jocher

(13x13 max conv -> (32 13)x(32 13)) = 416x416 receptive field)

Also you should take into account that conv3x3 stride=1 increases receptive field too, not only conv3x3 stride=2.

You can see receptive field in the Darknet by using:

[net]
show_receptive_field=1

WongKinYiu commented 4 years ago

@glenn-jocher

Could you provide your cfg file and training command? I will modify it and train on ultralytics. (it will get error in test.py if i add P6 yolo layer in cfg.)

by the way, do you train/val on coco2014, or on coco2017?

glenn-jocher commented 4 years ago

@WongKinYiu yes here is the p6 cfg with 12 anchors, and a modified version of yolov3-spp called yolov4 that has the same 12 anchors, which trains to slightly above yolov3-spp (+0.1mAP).

I had to add a lot of convolutions to p6, so it has 81M params. I doubled the width of the stem convolutions (which use few params), but reduced the width of the largest head convolutions (i.e. 1024 -> 640 channels). Overall the result was slightly negative though, so you may want to adjust the cfg.

python3 train.py --data coco2014.data --img-size 416 608 --epochs 300 --batch 16 --accum 4 --weights '' --device 0 --cfg yolov4-81M-p6.cfg --name p6 --multi

yolov4-81M-p6.cfg.txt

AlexeyAB commented 4 years ago

@glenn-jocher Try to train and test this model with network resolution 832x832 (with random shapes). Also why you didn't use SPP-block?

glenn-jocher commented 4 years ago

@AlexeyAB yes maybe I should put the SPP block back in on the P6 layer, and return the dn53 stem convolutions to their original sizes.

When I changed dn53 I saw that there were 8, 8 and 4 blocks in the last 3 downsamples. For p6 I changed this to 8, 8, 8 and 8 (no spp). Maybe I should update to 8, 8, 8, 4+spp, which would more closely mimic yolov3-spp.

WongKinYiu commented 4 years ago

@glenn-jocher

start training yolov3-spp and yolov3-spp-p6. the loss of yolov3-spp-p6 is very large at the 1st epoch when compare to yolov3-spp.

glenn-jocher commented 4 years ago

@WongKinYiu yes, the loss is larger, in part because the total loss is the sum of the layer losses, i.e.: total_obj_loss = obj_layer1_loss.mean() + obj_layer2_loss.mean() + obj_layer3_loss.mean()

whereas p6 will have an additional + obj_layer4_loss.mean(). But it may also simply be larger because the model is poorly designed.

WongKinYiu commented 4 years ago

@glenn-jocher @AlexeyAB

good pytorch implementation of EfficientDet, 26x faster than official Tensorflow implementation. https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch

AlexeyAB commented 4 years ago

@WongKinYiu Yes, @zylo117 achieved +3 (+10%) FPS and -1.2 (-4%) AP compared to stated results in the paper. But it is 25x faster than public TF-code: https://github.com/zylo117/Yet-Another-EfficientDet-Pytorch/issues/77

glenn-jocher commented 4 years ago

@AlexeyAB yes it looks like a very good pytorch addition!

I'm pretty suspicious about the capability of the repo to train from scratch currently though (he mentions the pytorch weights are transferred from tf and finetuned slightly), but the author seems to be a very in depth expert on efficientdet at least.

It seems efficientdet training is very slow in general with large memory requirements.

varghesealex90 commented 4 years ago

I really dont think EfficientDet works well for real time operations. On Nvida P100 gpu, the inference speed (pre- & postprocessing) is 10. On the contrary, the speed of YOLO is 17 ms ( 58 FPS)

zylo117 commented 4 years ago

but i got 30+FPS at batchsize 1 when evaluating d0. Since the evaluation includes pre and post processing, I think you need to optimize your implement. 10 FPS is way too slow.

my environment is: torch1.4 torchvision0.5 Python3.7(not anaconda) Ubuntu19.10 i5 8400 rtx2080ti

AlexeyAB commented 4 years ago

@zylo117 How many FPS do you get for yolov3-spp? While yolov3-spp has higher accuracy than EfficientDet D0.

zylo117 commented 4 years ago

@zylo117 How many FPS do you get for yolov3-spp? While yolov3-spp has higher accuracy than EfficientDet D0.

I haven't try yolov3-spp yet, but I do know yolov3-spp is faster and more accurate than d0. however, 8 fps seems strange to me.

I think the real advantage of efficientdet is that it consumes less memory and has less ops, so that more models can deploy on the same device or running with larger batchsize for offline tasks, as I mentioned at my repo's readme, it gets 163 fps at batchsize 32.

glenn-jocher commented 4 years ago

@zylo117 does efficientdet use many grouped convolutions, or depthwise convolutions? Can you link to the basic bottleneck module in your repo? I think grouped convolutions may be a cause of slower speed. I know this can cause much slower training in pytorch, and actually slow down inference as well, even if the model has less parameters and fewer FLOPS.

Pytorch (or Nvidia) lack a native cudnn backend kernel for grouped convolutions I think, so Pytorch falls back on it's default method, which is not cuda optimized as I understand it. I made a small notebook to test these timing effects:

EDIT1: to be clear this would be a general issue with modern 'efficient' object detectors like efficientdet, and backbones like resnext, not specific to only @zylo117's pytorch implementation. Indeed there may be no real solution other than to wait for nvidia and pytorch to implement a cuda kernel for optimizing grouped convolution operations on gpu.

EDIT2: the timing effects shown simulate training, where a forward and backward pass are run on the convolution, but I believe inference shows similar but less severe slowdowns.

AlexeyAB commented 4 years ago

@zylo117 Yes, EfficientDet is more suitable for batch-inference and inference on TPU-edge.

What mini-batch size do you use for training?
What GPU and how many GPUs do you use for training EfficientDet D3 ?
How long do you train EfficientDet D3 (is it about ~1 month)?

zylo117 commented 4 years ago

@zylo117 does efficientdet use many grouped convolutions, or depthwise convolutions? Can you link to the basic bottleneck module in your repo? I think grouped convolutions may be a cause of slower speed. I know this can cause much slower training in pytorch, and actually slow down inference as well, even if the model has less parameters and fewer FLOPS.

Pytorch (or Nvidia) lack a native cudnn backend kernel for grouped convolutions I think, so Pytorch falls back on it's default method, which is not cuda optimized as I understand it. I made a small notebook to test these timing effects:

EDIT1: to be clear this would be a general issue with modern 'efficient' object detectors like efficientdet, and backbones like resnext, not specific to only @zylo117's pytorch implementation. Indeed there may be no real solution other than to wait for nvidia and pytorch to implement a cuda kernel for optimizing grouped convolution operations on gpu.

EDIT2: the timing effects shown simulate training, where a forward and backward pass are run on the convolution, but I believe inference shows similar but less severe slowdowns.

can you share the notebook? I'd like to run it on my environment. the link you provided is not accessible.

zylo117 commented 4 years ago

@zylo117 does efficientdet use many grouped convolutions, or depthwise convolutions? Can you link to the basic bottleneck module in your repo? I think grouped convolutions may be a cause of slower speed. I know this can cause much slower training in pytorch, and actually slow down inference as well, even if the model has less parameters and fewer FLOPS.

Pytorch (or Nvidia) lack a native cudnn backend kernel for grouped convolutions I think, so Pytorch falls back on it's default method, which is not cuda optimized as I understand it. I made a small notebook to test these timing effects:

EDIT1: to be clear this would be a general issue with modern 'efficient' object detectors like efficientdet, and backbones like resnext, not specific to only @zylo117's pytorch implementation. Indeed there may be no real solution other than to wait for nvidia and pytorch to implement a cuda kernel for optimizing grouped convolution operations on gpu.

EDIT2: the timing effects shown simulate training, where a forward and backward pass are run on the convolution, but I believe inference shows similar but less severe slowdowns.

nvm, I tried to implement it myself. this is the code.

import time

import torch
from torch import nn

k = 3
x = torch.randn((1, 128, 512, 512)).cuda()
print('%10s%10s%10s %-20s' % ('groups', 'time(ms)', 'params', 'shape m'))
for g in [1, 2, 4, 8, 16, 32, 64, 128]:
    m = nn.Conv2d(128, 256, k, stride=1, groups=g, padding=k // 2, bias=False).cuda()
    t1 = time.time()
    for _ in range(1000):
        m(x)
    t2 = time.time()
    t = t2 - t1
    p = list(m.parameters())[0]
    print('%10g%10.1f%10g %-20s' % (g, t, p.numel(), list(p.shape)))

And this is what I got,

      groups  time(ms)    params shape m             
         1       4.5    294912 [256, 128, 3, 3]    
         2       2.6    147456 [256, 64, 3, 3]     
         4       2.5     73728 [256, 32, 3, 3]     
         8       1.9     36864 [256, 16, 3, 3]     
        16       2.9     18432 [256, 8, 3, 3]      
        32       3.1      9216 [256, 4, 3, 3]      
        64       0.0      4608 [256, 2, 3, 3]      
       128       0.0      2304 [256, 1, 3, 3]

The result is not so bad when dealing with larger groups groupconv.

I guess, either it's torch had optimized group conv by version 1.4, or rtx2080ti is capable to deal with large gropus groupconv and provides speedup.

env: i5 8400 ubuntu 19.10 x64 rtx2080ti official python 3.7 torch 1.4 torchvision 0.5

glenn-jocher commented 4 years ago

@zylo117 ah sorry, I've made the notebook public now: https://colab.research.google.com/drive/1tBkFOSLl3V1DguDgtlm6bNErPVDBCD7z?authuser=1#scrollTo=cjpQb9AsbfGR

Yes your code looks good except that Pytorch timing is kind of tricky in that you need to run synchronize() every time right before time.time() to get the true time:

def tsync():
    torch.cuda.synchronize() if torch.cuda.is_available() else None
    return time.time()

I timed training and inference operations. Inference is also slower for grouped conv to a lesser degree:

zylo117 commented 4 years ago

@glenn-jocher after changing my code with your tsync() and run() and x, this is what I got. Groupconvs are almost as fast as normal convs. Also, if there are many groups, groupconvs are even faster.

training:

    groups  time(ms)    params shape m             
         1       8.9    294912 [256, 128, 3, 3]    
         2       8.5    147456 [256, 64, 3, 3]     
         4       8.7     73728 [256, 32, 3, 3]     
         8       9.3     36864 [256, 16, 3, 3]     
        16      10.2     18432 [256, 8, 3, 3]      
        32       9.5      9216 [256, 4, 3, 3]      
        64       7.8      4608 [256, 2, 3, 3]      
       128       7.5      2304 [256, 1, 3, 3]

inference:

    groups  time(ms)    params shape m             
         1       0.7    294912 [256, 128, 3, 3]    
         2       0.5    147456 [256, 64, 3, 3]     
         4       0.5     73728 [256, 32, 3, 3]     
         8       0.5     36864 [256, 16, 3, 3]     
        16       0.6     18432 [256, 8, 3, 3]      
        32       0.4      9216 [256, 4, 3, 3]      
        64       0.2      4608 [256, 2, 3, 3]      
       128       0.2      2304 [256, 1, 3, 3]

glenn-jocher commented 4 years ago

@zylo117 yes, exactly. My main discovery was not that the operations take longer, it is that size and FLOPS savings do not translate to faster speed. For example, given your results, a model composed of groups=16 convolutions would be 16X smaller and use 16X less FLOPS than a model made up entirely of comparable groups=1 convolutions, but it would not be any faster.

If it is only 8X smaller than a normal model, then it would be twice as slow...

WongKinYiu commented 4 years ago

@glenn-jocher

the total loss of new 12-anchors-model is very low, but the AP is very poor. currently 116 epochs and 6.89 total loss. [cfg] [weights]

glenn-jocher commented 4 years ago

@WongKinYiu ah sorry bud, did not realize you were training it currently. Could you create a results.png to show? It’s plot_results() in útils/útils.py.

I gave up on P6, it seems SPP is doing a good job of increasing receptive field on it’s own.

BTW I saw you guys published YOLOv4! Congratulations. I’ve been cooking up a few changes of my own over here, it looks like I’ll need a new name now 😃

WongKinYiu commented 4 years ago

@glenn-jocher

Oh, i use --notest for speeding up training, so there is no results to show.

OK, If I get any good results of P6 model, I will share it for your reference. Since Pytorch 1.5 includes a significant update to the C++ front-end, I would like to develop some new function with Pytorch.

Thank you, now I have time to read into details of your code and start to design a good head for an object detector. I borrow two 2080ti for developing new head of object detector based on ultralytics.

glenn-jocher commented 4 years ago

@WongKinYiu ah —notest! But then how do you know what the AP is?

It’s actually a bad time to start working the ultralytics/yolov3 repo. I’ve been working on a new repo which folds in all of my lessons learned over the last year from people trying to train their custom datasets. The new repo is simpler and cleaner, a step closer to AutoML style training, and produces better results on new architectures I’ve explored. I’ve redefined model architectures based on simple yaml files as well, which makes it easy to test new models. I’m aiming to release it in early may, but I’ll send you and Alexey invitations tomorrow. It’s called ultralytics/yolov4 ironically.

glenn-jocher commented 4 years ago

BTW the new pytorch code I wrote is super efficient space wise. The model yaml files that define the layers/anchors etc are only about 50 lines, and the actual model.py file that contains the model classes is only about 200 lines long, including the Yaml parser, detection module, forward method, etc. it’s really minimalist and easy to understand.

WongKinYiu commented 4 years ago

@glenn-jocher

I run test with last.pt using other gpu.

Thanks! I am glad to be invited into your new repository. Would you provide Dockerfile for quick installation? I have some experiments for tracking and instance segmentation based on combine ultralytics/yolov3 and mmdetection. If it is possible, I would like to merge segmentation and tracking to new ultralytics repository.

AlexeyAB commented 4 years ago

@glenn-jocher Hi,

I gave up on P6, it seems SPP is doing a good job of increasing receptive field on it’s own.

Did you add SPP-block after P6? How much did you increase: resolution, depth (num of layers), weights (filters)? Did you use alpha=1.2, beta=1.1, gamma=1.15 as stated in EfficientNet/Det article?

BTW I saw you guys published YOLOv4! Congratulations. I’ve been cooking up a few changes of my own over here, it looks like I’ll need a new name now smiley

Thanks! We have placed thanks and a link to your repository in the article. Currently YOLOv4 is the top1 for real time detection on video cards RTX2050 - TitanV/V100 for both AP / AP50.

I’ve redefined model architectures based on simple yaml files as well, which makes it easy to test new models. I’m aiming to release it in early may, but I’ll send you and Alexey invitations tomorrow. It’s called ultralytics/yolov4 ironically.

Yes, it makes sense to port YOLOv4 to your Pytorch implementation and use it for faster research and future developments. May be we should implement Detector with conv-LSTM with additional tracking with re-identification of any object.

glenn-jocher commented 4 years ago

@WongKinYiu actually yes, maybe I should just post you guys a docker image for now, as I actually haven't made any commits, I've just been developing locally and testing in GCP with docker images myself. Tracking is a very interesting feature that would add a lot of value, I've had several people inquire about this, but I would not underestimate the difficulty of implementing it well. I used to do kalman filter design in the past, as well as implement KLT trackers. The KLT tracker naturally uses a type of 2d correlation between a recent template and a small future search area. A feature vector from yolov3 would be much more information rich than that, and would not suffer the same drift problems over long time spans. There is a very big opportunity in the space.

@AlexeyAB I'm convinced now that focal loss only applies to detectors with combined classification+objectness loss into 1, like SSD and EfficientDet. These have a huge imbalance between foreground and background classes, unlike YOLOv3, which has a medium level of imbalance. At first I tried to combine obj and cls into one also, as its simpler to build, but I found its also a bit slower to run, because every inference has to compare thresholds across all classes for all anchors, rather than compare one threshold per anchor.

I saw the acknowledgements section in the paper, thanks! I think I'll explore P6 a bit later on, but for now I'm simply trying to get my new repo out. It's designed to be easier to use and harder to mess up.

AlexeyAB commented 4 years ago

@glenn-jocher

What do you think about XNOR-networks? Especially about SVR for XNOR training? https://www.researchgate.net/publication/323375650_A_Lightweight_YOLOv2_A_Binarized_CNN_with_A_Parallel_Support_Vector_Regression_for_an_FPGA

glenn-jocher commented 4 years ago

@AlexeyAB wow I didn't know about the acquisition. Was Ali a professor or advisor of Redmon's when Redmon was at university working on YOLO?

Its funny because Apple and Google are the yin and the yang of AI. Apple is, naturally as a hardware company, intensely focused on AI at the edge. I'm super excited for the 5nm A14 chip in the 2020 iPhones coming out later in the year, especially to see what TOPS they push the neural engine to. While Google is focused on the opposite, drawing everyone's dollars and euros to the cloud, where they can sell their GCP services and TPU hours.

As for XNOR I don't actually have any experience with them, but I've seen very impressive quantization with coreml, where the models I export from pytorch to coreml (through onnx) can be quantized to FP8 without any noticeable loss in precision. If XNOR ultimately wants to push that to single bit quantization I'm not so sure, there would obviously have to be precision tradeoffs. I suppose the real question is whether you could export a much larger, much higher performing model, i.e. 500M parameters, into a tiny xnor model that performs equally well to the FP32 models today at 60M parames like YOLOv3.

AlexeyAB commented 4 years ago

@glenn-jocher @WongKinYiu

There is a little regret that I did not go to work for them (with stock options), although there was such an opportunity )

I implemented XNOR inference for Yolo about ~1.5 years ago, But I transfer data between layers in the float, so the speed is lost. And accuracy is bad, since we should use another approach for training.

nVidia GPU supports XNOR GEMM for CC >= 7.5 by using wmma::bmma_sync(c2_frag, a_frag, b_frag, c2_frag); // XOR-GEMM https://github.com/AlexeyAB/darknet/blame/2fc7fbbc0ea001170b12d39b840b9f4d34905dd4/src/im2col_kernels.cu#L1224-L1419

glenn-jocher commented 4 years ago

@AlexeyAB yes I saw that before. Quantization seems to have different effects depending on the platform. In CoreML model speed is completely unaffected going from FP32 to FP16 to FP8, the only difference is the app bundle decreases in size if the model is prepackaged with it. So unless I'm doing something wrong there they see zero speedup.

In PyTorch I haven't tried quantization yet, but apparently the blog claims significant speedup. https://pytorch.org/blog/introduction-to-quantization-on-pytorch/

glenn-jocher commented 4 years ago

@AlexeyAB maybe the pytorch guys are getting their speedup from blatantly assuming larger batch sizes? This could make sense, since iDetection only runs ones image at time, since it's realtime on the iPhone, so perhaps explains the lack of speedup.

WongKinYiu commented 4 years ago

there are two networks I am interested in:

xnor-net: it can be applied to in-memory computing
addernet: cnn without multiplicaion

AlexeyAB / darknet

EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO #4346