EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO

AlexeyAB commented 4 years ago

EfficientDet: Scalable and Efficient Object Detection

paper: https://arxiv.org/abs/1911.09070v1

First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multi-scale feature fusion;

AlexeyAB commented 4 years ago

@glenn-jocher @WongKinYiu Since batching always increase latency, so batching is not always applicable.

Quantization seems to have different effects depending on the platform. In CoreML model speed is completely unaffected going from FP32 to FP16 to FP8, the only difference is the app bundle decreases in size if the model is prepackaged with it. So unless I'm doing something wrong there they see zero speedup.

Could it be that quantization to FP8 occurred automatically in all three cases?

Did you try such models BIN/XNOR models?

resnet18-xnor-binary-onnx-0001 - 61.71% Top1 (while for FP32 70.7% Top1 +9.0) https://github.com/opencv/open_model_zoo/blob/master/models/intel/resnet18-xnor-binary-onnx-0001/description/resnet18-xnor-binary-onnx-0001.md
resnet50-binary-0001 - 70.69% Top1 (while for FP32 75.8 Top1 +5.1) https://github.com/opencv/open_model_zoo/blob/master/models/intel/resnet50-binary-0001/description/resnet50-binary-0001.md

xnor-net: it can be applied to in-memory computing

What does it mean? In-memory for ASIC/ULA/FPGA? Or in-memory for CPU-cache?

glenn-jocher commented 4 years ago

@WongKinYiu ah no, it's not possible that FP8 was used for all 3, because it takes a long time to quantize from FP32 down to FP8, so I don't think the iPhone was doing this operation on the fly, or even on first run. Also the app sizes mirrored the model sizes well.

I haven't tried any of the models, since my main use case is realtime detection on iOS, and it seems we are just about there already without xnor. What is the difference between xnor and binary?

WongKinYiu commented 4 years ago

@AlexeyAB @glenn-jocher

i have trained xnor-net and abc-net, but accuracy is not promising enuogh in my task, the results of mixture of floating and binary network is ok, but hard to accelerate in a general framework.

In-memory computing (IMC) stores data in RAM rather than in databases hosted on disks. This eliminates the I/O and ACID transaction requirements of OLTP applications and exponentially speeds data access because RAM-stored data is available instantaneously, while data stored on disks is limited by network and disk speeds. IMC can cache massive amounts of data, enabling extremely fast response times, and store session data, which can help achieve optimum performance.

only binary operation can be easier applied to in-memory computing. for example shift, or, and, xnor.

and for normal computing, the computational unit usually use nand gate, so xnor-net or nand-net are better choice than binary-network which use and/or/not.

glenn-jocher commented 4 years ago

@WongKinYiu ah, then the --cache command on ultralytics/yolov3 is the same as IMC? --cache stores all of the training data in RAM, which speeds up training significantly, especially for smaller datasets. This is a challenge for COCO though, where you need about 150GB of RAM to store all images at 640 resolution.

WongKinYiu commented 4 years ago

@glenn-jocher

--cache is for reducing data loader time, not for computing. IMC doing calculation of network in memory.

so for digital, in-memory computing is a trend. and for analog, maybe spiking network.

glenn-jocher commented 4 years ago

@WongKinYiu ah ok. I just had an idea BTW to maybe save memory when caching. Currently --cache loads an image, resizes it to the training --img-size (i.e. 640), and leaves it in RAM for all the dataloader workers to access when they want.

For datasets with large images i.e. 1920x1080, caching at 640 will save RAM, but for datasets like COCO, resizing all of the images to 640 would actually use up RAM unnecessarily, as many images are smaller than 640. So perhaps I could move the resize operation out of the caching function. This would reduce RAM requirements, but then the image would need to be resized every epoch (4 times per epoch when using mosaic loader), so there would be a slight hit to speed, small compared to loading time from the hard disk though.

glenn-jocher commented 4 years ago

@AlexeyAB just saw your message about rejecting the job offer from xnor.ai. haha, that's a big shame. It might not be too late though, maybe Ali could vouch for you if you applied to them through Apple now. Did you do an in person interview with them before?

AlexeyAB commented 4 years ago

@glenn-jocher Via zoom. A lot more significant happened in my life, so no, not a big shame ) But overall, the company is very successful.

glenn-jocher commented 4 years ago

@WongKinYiu I ported the yolov4.cfg from this repo into mine, and made sure it's trainable. Not all of the atrributes of each class are used, especially in the yolo layers, but the correct feature fusion and mish activations are used, and training seems to proceed with no errors now. I may need to implement a more memory friendly version of Mish though: https://github.com/ultralytics/yolov3/issues/1098

Do you have any pytorch mish implementations you'd recommend?

glenn-jocher commented 4 years ago

@WongKinYiu you mentioned you had some free gpus. Could you use one to train yolov4.cfg on ultralytics? Then we can get an apples to apples comparison with the latest yolov3-spp results. I can also add it to the training plots as well. You can swap smaller --batch as necessary, the repo handles the loss scaling accordingly now, so you can change --batch without worrying about --accum, which is deprecated. This does multi-scale 320 to 640, and tests at 640.

python train.py --img 320 640 --batch 8 --weights '' --cfg yolov4.cfg --data coco2014.data

WongKinYiu commented 4 years ago

@glenn-jocher

OK, I will stop P6 experiments and train it first.

But I have trained CSPDarknet53s model, which is same as the one in https://github.com/WongKinYiu/CrossStagePartialNetworks/tree/pytorch, using January and April. The repo in early April seems very fit to YOLOv3-SPP. Using same model, I get +2% AP50 on YOLOv3-SPP and -3% AP50 on CSPDarknet53s-PANet-SPP when training with January and April repo.

And when I use docker to install repo in late April, it shows that there are some layers are not registered. I will try to update to latest repo today, if I meet any problem, I will give you a feedback.

glenn-jocher commented 4 years ago

@WongKinYiu actually wait, you are right, there are unimplemented layers still. The basic cfg file works, and I've removed the error, but many of the yolo layer attributes are not used. OK just hold on for a bit then.

WongKinYiu commented 4 years ago

@glenn-jocher Thank you very much!

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu have you guys tried training yolov4.cfg without mish? Mish seems to use massive amounts of gpu ram in pytorch at least. I did a benchmark and found almost 3X the GPU RAM requirements for yolov4.cfg vs yolov3-spp.cfg.

I'm worried this may put a lot of people off and hinder wider adoption. I'm also worried about exportability through onnx to tflite, coreml etc with a custom activation function like this.

WongKinYiu commented 4 years ago

@glenn-jocher hello,

yes, we did. it drops about 0.6% ap. maybe you can use this cfg https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620276493

glenn-jocher commented 4 years ago

@WongKinYiu ah ok, good to know. Let me play around with it a bit, maybe I'll have some ideas.

AlexeyAB commented 4 years ago

@glenn-jocher Mish (mish in backbone) improves AP for CSDarknet-Panet but degrades AP for CSResNext-Panet

glenn-jocher commented 4 years ago

@AlexeyAB @WongKinYiu I think I finally discovered why none of these PANet models were working well on ultralytics. I started training a relu version of yolov4, with poor initial results, and I realized I have a hard-coded stride array in place in my repo: 32, 16, 8, the yolov3-spp output order. PANet is reversed, so I think I've been scaling my P3 and P5 anchors incorrectly in all of these latest models. I will reverse the order and train again.

My new repo includes strides in the model yaml files, so they are parameterized along with the model. I might try and take it a step further and solve for them automatically during a preliminary forward pass. Anyway, my mistake!

Should be fixed now, more or less, in https://github.com/ultralytics/yolov3/commit/9cc4951d4fb8df0cf1c9fed5e60c01c150e78a0c

            stride = [32, 16, 8]  # P5, P4, P3 strides
            if 'panet' in cfg or 'yolov4' in cfg:  # stride order reversed
                stride = list(reversed(stride))

WongKinYiu commented 4 years ago

@glenn-jocher

OK, thank you. So maybe my P6 model gets low loss but also low AP is due to there are one stride is missing.

glenn-jocher commented 4 years ago

@WongKinYiu ahh, yes, this could affect your P6 also. I think when I ran my P6 I manually swapped the strides, but of course the master branch does not have this, so if you clone and try to run P6, you will also be affected. I'm sorry! I don't have a 100% foolproof solution, my commit above is a bit of a band-aid unfortunately.

My new repo has the stride info included as part of the model yaml. Here is the entire yolov3-spp.yaml for example. The modules are either custom, like Conv(), or pytorch modules, like nn.Conv2d().

I'm thinking a more automl-like solution would do 1 forward pass on a fixed image shape, i.e. 1x3x640x640, and automatically compute strides at each Detect() module. This would completely remove the human from the loop, causing less problems. But for now I have this.

# parameters
nc: 80  # number of classes
strides: [8, 16, 32]  # strides P5, P4, P3

# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3
  - [30,61, 62,45, 59,119]  # P4
  - [116,90, 156,198, 373,326]  # P5

# darknet53 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Conv, [32, 3, 1]],  # 0
   [-1, 1, Conv, [64, 3, 2]],  # 1-P1/2
   [-1, 1, Bottleneck, [64]],
   [-1, 1, Conv, [128, 3, 2]],  # 3-P2/4
   [-1, 2, Bottleneck, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 5-P3/8
   [-1, 8, Bottleneck, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 7-P4/16
   [-1, 8, Bottleneck, [512]],
   [-1, 1, Conv, [1024, 3, 2]], # 9-P5/32
   [-1, 4, Bottleneck, [1024]],  # 10
  ]

# yolov3-spp head
# na = len(anchors[0])
head:
  [[-1, 1, Bottleneck, [1024, False]],
   [-1, 1, Conv, [512, 1, 1]],
   [-1, 1, SPP, [512, [5, 9, 13]]],
   [-1, 1, Conv, [1024, 3, 1]],
   [-1, 1, Conv, [512, 1, 1]],
   [-1, 1, Conv, [1024, 3, 1]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]],  # 17 (P5-large)

   [-3, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 8], 1, Concat, [1]],  # cat backbone P4
   [-1, 1, Bottleneck, [512, False]],
   [-1, 1, Bottleneck, [512, False]],
   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, Conv, [512, 3, 1]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]],  # 25 (P4-medium)

   [-3, 1, Conv, [128, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P3
   [-1, 1, Bottleneck, [256, False]],
   [-1, 2, Bottleneck, [256, False]],
   [-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]],  # 31 (P3-small)
   [[-1, 25, 17], 1, Detect, [nc, strides, anchors]],   # Detect(P3, P4, P5)
  ]

glenn-jocher commented 4 years ago

@WongKinYiu @AlexeyAB ok, I've updated my new repo with auto-strides now, computed during a single forward pass during model init, so there is no more human error possible. The only likely remaining error source is that the anchor order may be accidentally reversed. The class counts in the detection layers are automatically compared to the classes in the data when building the model also, so specifying an incorrect class count in the yaml will not break the training either (which happens all the time to users now).

I will also add anchor-order error checking to the error check list that runs before training starts. This is really exciting, I'm slowly removing every possible route that users could use to 'break' the training, and to reduce the hyperparameters they can modify to an absolute minimum. I think this will really help a lot more people train custom datasets successfully. I'm going to add a kmeans step also that runs before training starts, so you can specify your own anchors if you want, or leave them empty for the training algorithm to create its own.

I should have the repo out soon, I'm still cleaning it up and working out the kinks.

WongKinYiu commented 4 years ago

@glenn-jocher

Thank you, can I start training https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620270472 using latest repo?

glenn-jocher commented 4 years ago

@WongKinYiu yes I think you can start training, but be aware that many of the attributes in the yolo class won't have the same effect as they do here (though these also have no effect when training yolov3-spp.cfg to 43mAP)

After fixing the stride issue, I started training a yolov4-relu.cfg, using the same anchors as yolov3-spp, to isolate the training changes caused by the new architecture (i.e. all else being equal). Unfortunately the new training (orange) is coming in below the yolov3-spp metrics (blue), at least at this early stage. Training will take about two weeks, so I'll leave it running. The only difference between this and what you would run @WongKinYiu is Mish and the updated anchors.

results

WongKinYiu commented 4 years ago

@glenn-jocher Hello,

Do you train https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620270472 using V100 GPU? I can not train yolov4 even set batch=4 due to OOM.

glenn-jocher commented 4 years ago

@WongKinYiu yes I know! The gpu memory usage in pytorch for yolov4 is off the charts. See https://github.com/ultralytics/yolov3/issues/1098#issuecomment-620194657

The above plot is training yolov4-relu.cfg, which just swaps the mish activations for relu. I can't really train normal yolov4 either, it's just not practical. In addition to the 3X gpu memory usage, the train speed is about 2X slower, due primarily to the smaller batches.

glenn-jocher commented 4 years ago

@WongKinYiu but the answer your question I'm training on T4. It's slower but more economical, and 15GB mem. 1 epoch with yolov4.cfg takes about 150 minutes. One epoch of yolov4-relu.cfg takes about 70 min, only slightly slower than yolov3-spp.cfg, and about the same memory.

WongKinYiu commented 4 years ago

@glenn-jocher

Thanks, currently i change backbone to CSPDarknet53s and train with batch=4.

additional one question, do P3-P6 and P3-P7 models can be normal trained using latest repo? or i need to modify parts of code?

glenn-jocher commented 4 years ago

@WongKinYiu @AlexeyAB was reviewing effientdet paper again and was surprised by a training value I'd missed before. Their weight decay is 4e-5, batch-size 128.

In ultralytics/yolov3, (and darknet and ASFF also), weight decay is about 10x larger, and also applied more frequently I believe since we use batch-size 64 (weight decay is applied once per optimizer update every 64 images). This seems like quite a large discrepancy. I verified the value in the official repo as well: https://github.com/google/automl/blob/17637b428d46b002f4586b9541f6b7bbf2fab4bf/efficientdet/hparams_config.py#L213

AlexeyAB commented 4 years ago

@glenn-jocher Hi, Do you want try to use decay=0.00004 instead of? https://github.com/AlexeyAB/darknet/blob/6cbb75d10b43a95f11326a2475d64500b11fa64e/cfg/yolov4.cfg#L11

Did you check that correct anchors/masks for P3-P6 gives better accuracy?

glenn-jocher commented 4 years ago

@AlexeyAB no, I have not tried training P6 yet. I'm busy trying to validate my new repo's training against my existing yolov3 repo. I made too many changes too fast and just discovered I wasn't doing weight decay properly, which caused a surge in mAP early on, in epochs 0-100, but then a peak at 100 and overtraining afterward, while the original ultralytics/yolov3 with weight decay trained nice and steady to a higher peak at 270ish out of 300. But in the process of double checking the weight decay I noticed efficientdet uses a much lower value. I don't have free GPUs now, but when I get the new repo launched I will train a new model at a lower weight decay to compare.

P6 looks like it would actually be easier to add on a PANet like yolov4 than FPN like yolov3 in any case.

The current yolov4-relu training looks like this. This is a special yolov4 training just to compare architecture change effects in the absence of all the additional changes. So far I'm not seeing an improvement in the yolov4 architecture (orange) vs yolov3-spp (blue). We will have to wait a long time for this result, training is poking along at about 20 epochs per day on a T4. results

WongKinYiu commented 4 years ago

@AlexeyAB @glenn-jocher

I am training a P6 model on a single 2080ti. ~20 epochs per day.

yolov4-mish ~10 epochs per day due to batch size.

AlexeyAB commented 4 years ago

@glenn-jocher

So far I'm not seeing an improvement in the yolov4 architecture (orange) vs yolov3-spp (blue).

May be some advantage of yolov4-architecutre: CSP + PAN (instead of FPN) - can be achieved only by using pre-trained weights-file that is trained with BoF+BoS+Mish on ImageNet? Or large model should be trained longer.

I don't have free GPUs now, but when I get the new repo launched I will train a new model at a lower weight decay to compare.

Do you use decay=0.0005 now? And where do you get free GPUs?

glenn-jocher commented 4 years ago

@AlexeyAB well, all of my gpus at the moment are from a GCP credit that Ultralytics received when we participated in an accelerator last year called Decelera, in Mayakoba Mexico. I'm not sure if it's $20k or possibly $100k, but to make the most efficient use of the credits, i.e. the most epochs/$ I'm training on T4's at about $400 each per month. Unfortunately they are quite slow, about 2-3X slower than a 2080 ti, but they do come with 15G RAM, which is nice.

Day to day tests I run on Colab, since I don't actually have any usable local GPUs. I do all my work on a macbook pro, which does not support cuda egpus due to some ridiculous fight between apple and nvidia.

I'm waiting for the 3080 ti's to come out later on this year and then I think I might finally buy a box for myself, probably a 4-gpu box from lambda labs for about $8k.

Yes the pretraining might be the missing link. I do all of my training from scratch actually, after I saw better results this way in a side by side comparison last year. Unfortunately I usually see earlier overfitting on coco when using pretrained weights.

glenn-jocher commented 4 years ago

@AlexeyAB yes I'm using the same weight decay as here, 5E-4

WongKinYiu commented 4 years ago

@glenn-jocher @AlexeyAB

I just finish training CSPDarknet53s-PANet-SPP with optimized anchor for 512x512 using ultrlytics.

Using python3 test.py --cfg cd53s.cfg --weights last.pt --img 512 --iou-thr 0.7 to test. I get 43.2% AP_0.50:0.95.

Using python3 test.py --cfg cd53s.cfg --weights last.pt --img 608 --iou-thr 0.7 to test. I get 44.4% AP_0.50:0.95.

AlexeyAB commented 4 years ago

@WongKinYiu @glenn-jocher Nice!

Do you get this result on valid or test-dev eval server?
Does --iou-thr 0.7 use IOU_thresh=0.7 or 0.5...0.95 for AP calculation?
What was fixed to get good results?
So you don't use --augment ?

WongKinYiu commented 4 years ago

@AlexeyAB

Do you get this result on valid or test-dev eval server?

it is 5k set, i would like to evaluate test-dev set tomorrow. Does --iou-thr 0.7 use IOU_thresh=0.7 or 0.5...0.95 for AP calculation?

0.7 is iou threshold of nms. What was fixed to get good results?

just fix the stride order of yolo layers. So you don't use --augment ?

no, i do not use --augment.

by the way, if use best.pt, it gets 43.4%/44.5% with input resolution 512x512/608x608.

AlexeyAB commented 4 years ago

@WongKinYiu

So you get 44.4% AP50...95 on 5k-valid dataset, it gives +1.3% AP compared to Yolov3-spp 43.1% AP50...95. https://github.com/ultralytics/yolov3#map

What mini-batch size did you use?

Yolov4 416x416 gives 47.1% AP50...95 on 5k-valid eval server.

glenn-jocher commented 4 years ago

@WongKinYiu ah great! 44.5 is a great result, and yes 0.7 --iou is best for mAP@0.5:0.95.

Can you post your results.png here? Also can you link to this cfg?

Yes, in general you should probably always use best.pt after training is complete. last.pt can be used to --resume for example, but in most/all? cases best.pt should provide the best results.

glenn-jocher commented 4 years ago

@AlexeyAB yes this should be an apples to apples comparison, current mAP for yolov3-spp on coco2014 is 43.1 vs 44.5, so +1.4.

How do you get 47.1mAP?? That sounds extremely good, what's the catch?

AlexeyAB commented 4 years ago

@glenn-jocher This is for 5k-val set, not for test-dev.

YOLOv4 416x416:

5k-val: 47.1% AP50...95
testdev: 41.2% AP50...95

WongKinYiu commented 4 years ago

@glenn-jocher @AlexeyAB

I finish test the results on 2017 test-dev set using best.pt.

cfg best.pt last.pt

512x512 gets similar performance when compare with yolov4.

608x608 gets 44.1% AP!

glenn-jocher commented 4 years ago

@WongKinYiu thanks, I compared the cfg with https://github.com/ultralytics/yolov3/blob/master/cfg/yolov4-relu.cfg, and the only difference is 4 convolutions and 2 routes have been commented out. I'm surprised that you got such a good result then, because when I trained yolov4-relu.cfg up to 100 epochs it was still training yolov3-spp (see https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-622516664) so I cancelled the training. Can you post your results.txt so I can plot side by side with yolov3-spp and my yolov4-relu training?

glenn-jocher commented 4 years ago

@WongKinYiu also that's interesting that your best.pt is 500 MB, this means that the optimizer is packaged with the weights. I commited a change about a month ago to strip the optimizers from best.pt and last.pt once training was complete, so this means you have a slightly dated version of the repo. Do you know which git hash your repo has? It would be good of me to compare to make sure I haven't made any large changes since then, since your version seems to be working well.

WongKinYiu commented 4 years ago

@glenn-jocher

I use the 1 May repo.

I always gets error at this line https://github.com/ultralytics/yolov3/blob/master/test.py#L211 when use python train.py .... So I can not evaluate the performance during training. But I can run python test.py ... without any error...

And the strip function is at https://github.com/ultralytics/yolov3/blob/master/train.py#L365 which is behind the evaluation process. I think it is the reason why the step of "strip the optimizers from best.pt and last.pt once training was complete" is not be processed.

by the way, p6-model gets only 40% AP.

glenn-jocher commented 4 years ago

@WongKinYiu oh, I understand. Training completes 300 epochs, and then tries to use pycocotools for final mAP, then crashes, so strip function is not run.

Great, that is very recent, there are essentially zero changes that should affect training and testing in that time compared to the current repo!

Could you post your results.txt file here for cd53s?

glenn-jocher commented 4 years ago

@WongKinYiu btw, this is a numpy-pycocotools bug. If you install numpy == 1.17 it resolves the issue.

WongKinYiu commented 4 years ago

@glenn-jocher

Could you post your results.txt file here for cd53s?

Due to the numpy-pycocotools bug, the strip function and rename steps are not processed. My results.txt is already covered by new training.

glenn-jocher commented 4 years ago

@WongKinYiu hmm ok. Maybe I can try and get it directly from the model, since the training info was never stripped from it it may still be there. I'll try and do that and plot it against https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-622516664

glenn-jocher commented 4 years ago

I was able to retrieve the training results from the model file. I plotted against yolov3-spp (43.1mAP) and yolov4-relu (training cancelled after 100 epochs). Results are overall very similar, though overtraining seems to be a bit less, and objectness in particular looks a bit different. What was your training command for this training?

results

AlexeyAB / darknet

EfficientDet: Scalable and Efficient Object Detection - 51.0% mAP@0.5...0.95 COCO #4346