Open AlexeyAB opened 4 years ago
@glenn-jocher @WongKinYiu Since batching always increase latency, so batching is not always applicable.
Quantization seems to have different effects depending on the platform. In CoreML model speed is completely unaffected going from FP32 to FP16 to FP8, the only difference is the app bundle decreases in size if the model is prepackaged with it. So unless I'm doing something wrong there they see zero speedup.
Could it be that quantization to FP8 occurred automatically in all three cases?
Did you try such models BIN/XNOR models?
resnet18-xnor-binary-onnx-0001 - 61.71% Top1 (while for FP32 70.7% Top1 +9.0) https://github.com/opencv/open_model_zoo/blob/master/models/intel/resnet18-xnor-binary-onnx-0001/description/resnet18-xnor-binary-onnx-0001.md
resnet50-binary-0001 - 70.69% Top1 (while for FP32 75.8 Top1 +5.1) https://github.com/opencv/open_model_zoo/blob/master/models/intel/resnet50-binary-0001/description/resnet50-binary-0001.md
xnor-net: it can be applied to in-memory computing
What does it mean? In-memory for ASIC/ULA/FPGA? Or in-memory for CPU-cache?
@WongKinYiu ah no, it's not possible that FP8 was used for all 3, because it takes a long time to quantize from FP32 down to FP8, so I don't think the iPhone was doing this operation on the fly, or even on first run. Also the app sizes mirrored the model sizes well.
I haven't tried any of the models, since my main use case is realtime detection on iOS, and it seems we are just about there already without xnor. What is the difference between xnor and binary?
@AlexeyAB @glenn-jocher
i have trained xnor-net and abc-net, but accuracy is not promising enuogh in my task, the results of mixture of floating and binary network is ok, but hard to accelerate in a general framework.
In-memory computing (IMC) stores data in RAM rather than in databases hosted on disks. This eliminates the I/O and ACID transaction requirements of OLTP applications and exponentially speeds data access because RAM-stored data is available instantaneously, while data stored on disks is limited by network and disk speeds. IMC can cache massive amounts of data, enabling extremely fast response times, and store session data, which can help achieve optimum performance.
only binary operation can be easier applied to in-memory computing. for example shift, or, and, xnor.
and for normal computing, the computational unit usually use nand gate, so xnor-net or nand-net are better choice than binary-network which use and/or/not.
@WongKinYiu ah, then the --cache command on ultralytics/yolov3 is the same as IMC? --cache stores all of the training data in RAM, which speeds up training significantly, especially for smaller datasets. This is a challenge for COCO though, where you need about 150GB of RAM to store all images at 640 resolution.
@glenn-jocher
--cache is for reducing data loader time, not for computing. IMC doing calculation of network in memory.
so for digital, in-memory computing is a trend. and for analog, maybe spiking network.
@WongKinYiu ah ok. I just had an idea BTW to maybe save memory when caching. Currently --cache loads an image, resizes it to the training --img-size (i.e. 640), and leaves it in RAM for all the dataloader workers to access when they want.
For datasets with large images i.e. 1920x1080, caching at 640 will save RAM, but for datasets like COCO, resizing all of the images to 640 would actually use up RAM unnecessarily, as many images are smaller than 640. So perhaps I could move the resize operation out of the caching function. This would reduce RAM requirements, but then the image would need to be resized every epoch (4 times per epoch when using mosaic loader), so there would be a slight hit to speed, small compared to loading time from the hard disk though.
@AlexeyAB just saw your message about rejecting the job offer from xnor.ai. haha, that's a big shame. It might not be too late though, maybe Ali could vouch for you if you applied to them through Apple now. Did you do an in person interview with them before?
@glenn-jocher Via zoom. A lot more significant happened in my life, so no, not a big shame ) But overall, the company is very successful.
@WongKinYiu I ported the yolov4.cfg from this repo into mine, and made sure it's trainable. Not all of the atrributes of each class are used, especially in the yolo layers, but the correct feature fusion and mish activations are used, and training seems to proceed with no errors now. I may need to implement a more memory friendly version of Mish though: https://github.com/ultralytics/yolov3/issues/1098
Do you have any pytorch mish implementations you'd recommend?
@WongKinYiu you mentioned you had some free gpus. Could you use one to train yolov4.cfg on ultralytics? Then we can get an apples to apples comparison with the latest yolov3-spp results. I can also add it to the training plots as well. You can swap smaller --batch as necessary, the repo handles the loss scaling accordingly now, so you can change --batch without worrying about --accum, which is deprecated. This does multi-scale 320 to 640, and tests at 640.
python train.py --img 320 640 --batch 8 --weights '' --cfg yolov4.cfg --data coco2014.data
@glenn-jocher
OK, I will stop P6 experiments and train it first.
But I have trained CSPDarknet53s model, which is same as the one in https://github.com/WongKinYiu/CrossStagePartialNetworks/tree/pytorch, using January and April. The repo in early April seems very fit to YOLOv3-SPP. Using same model, I get +2% AP50 on YOLOv3-SPP and -3% AP50 on CSPDarknet53s-PANet-SPP when training with January and April repo.
And when I use docker to install repo in late April, it shows that there are some layers are not registered. I will try to update to latest repo today, if I meet any problem, I will give you a feedback.
@WongKinYiu actually wait, you are right, there are unimplemented layers still. The basic cfg file works, and I've removed the error, but many of the yolo layer attributes are not used. OK just hold on for a bit then.
@glenn-jocher Thank you very much!
@AlexeyAB @WongKinYiu have you guys tried training yolov4.cfg without mish? Mish seems to use massive amounts of gpu ram in pytorch at least. I did a benchmark and found almost 3X the GPU RAM requirements for yolov4.cfg vs yolov3-spp.cfg.
I'm worried this may put a lot of people off and hinder wider adoption. I'm also worried about exportability through onnx to tflite, coreml etc with a custom activation function like this.
@glenn-jocher hello,
yes, we did. it drops about 0.6% ap. maybe you can use this cfg https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620276493
@WongKinYiu ah ok, good to know. Let me play around with it a bit, maybe I'll have some ideas.
@glenn-jocher Mish (mish in backbone) improves AP for CSDarknet-Panet but degrades AP for CSResNext-Panet
@AlexeyAB @WongKinYiu I think I finally discovered why none of these PANet models were working well on ultralytics. I started training a relu version of yolov4, with poor initial results, and I realized I have a hard-coded stride array in place in my repo: 32, 16, 8, the yolov3-spp output order. PANet is reversed, so I think I've been scaling my P3 and P5 anchors incorrectly in all of these latest models. I will reverse the order and train again.
My new repo includes strides in the model yaml files, so they are parameterized along with the model. I might try and take it a step further and solve for them automatically during a preliminary forward pass. Anyway, my mistake!
Should be fixed now, more or less, in https://github.com/ultralytics/yolov3/commit/9cc4951d4fb8df0cf1c9fed5e60c01c150e78a0c
stride = [32, 16, 8] # P5, P4, P3 strides
if 'panet' in cfg or 'yolov4' in cfg: # stride order reversed
stride = list(reversed(stride))
@glenn-jocher
OK, thank you. So maybe my P6 model gets low loss but also low AP is due to there are one stride is missing.
@WongKinYiu ahh, yes, this could affect your P6 also. I think when I ran my P6 I manually swapped the strides, but of course the master branch does not have this, so if you clone and try to run P6, you will also be affected. I'm sorry! I don't have a 100% foolproof solution, my commit above is a bit of a band-aid unfortunately.
My new repo has the stride info included as part of the model yaml. Here is the entire yolov3-spp.yaml for example. The modules are either custom, like Conv(), or pytorch modules, like nn.Conv2d().
I'm thinking a more automl-like solution would do 1 forward pass on a fixed image shape, i.e. 1x3x640x640, and automatically compute strides at each Detect() module. This would completely remove the human from the loop, causing less problems. But for now I have this.
# parameters
nc: 80 # number of classes
strides: [8, 16, 32] # strides P5, P4, P3
# anchors
anchors:
- [10,13, 16,30, 33,23] # P3
- [30,61, 62,45, 59,119] # P4
- [116,90, 156,198, 373,326] # P5
# darknet53 backbone
backbone:
# [from, number, module, args]
[[-1, 1, Conv, [32, 3, 1]], # 0
[-1, 1, Conv, [64, 3, 2]], # 1-P1/2
[-1, 1, Bottleneck, [64]],
[-1, 1, Conv, [128, 3, 2]], # 3-P2/4
[-1, 2, Bottleneck, [128]],
[-1, 1, Conv, [256, 3, 2]], # 5-P3/8
[-1, 8, Bottleneck, [256]],
[-1, 1, Conv, [512, 3, 2]], # 7-P4/16
[-1, 8, Bottleneck, [512]],
[-1, 1, Conv, [1024, 3, 2]], # 9-P5/32
[-1, 4, Bottleneck, [1024]], # 10
]
# yolov3-spp head
# na = len(anchors[0])
head:
[[-1, 1, Bottleneck, [1024, False]],
[-1, 1, Conv, [512, 1, 1]],
[-1, 1, SPP, [512, [5, 9, 13]]],
[-1, 1, Conv, [1024, 3, 1]],
[-1, 1, Conv, [512, 1, 1]],
[-1, 1, Conv, [1024, 3, 1]],
[-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]], # 17 (P5-large)
[-3, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 8], 1, Concat, [1]], # cat backbone P4
[-1, 1, Bottleneck, [512, False]],
[-1, 1, Bottleneck, [512, False]],
[-1, 1, Conv, [256, 1, 1]],
[-1, 1, Conv, [512, 3, 1]],
[-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]], # 25 (P4-medium)
[-3, 1, Conv, [128, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 6], 1, Concat, [1]], # cat backbone P3
[-1, 1, Bottleneck, [256, False]],
[-1, 2, Bottleneck, [256, False]],
[-1, 1, nn.Conv2d, [na * (nc + 5), 1, 1]], # 31 (P3-small)
[[-1, 25, 17], 1, Detect, [nc, strides, anchors]], # Detect(P3, P4, P5)
]
@WongKinYiu @AlexeyAB ok, I've updated my new repo with auto-strides now, computed during a single forward pass during model init, so there is no more human error possible. The only likely remaining error source is that the anchor order may be accidentally reversed. The class counts in the detection layers are automatically compared to the classes in the data when building the model also, so specifying an incorrect class count in the yaml will not break the training either (which happens all the time to users now).
I will also add anchor-order error checking to the error check list that runs before training starts. This is really exciting, I'm slowly removing every possible route that users could use to 'break' the training, and to reduce the hyperparameters they can modify to an absolute minimum. I think this will really help a lot more people train custom datasets successfully. I'm going to add a kmeans step also that runs before training starts, so you can specify your own anchors if you want, or leave them empty for the training algorithm to create its own.
I should have the repo out soon, I'm still cleaning it up and working out the kinks.
@glenn-jocher
Thank you, can I start training https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620270472 using latest repo?
@WongKinYiu yes I think you can start training, but be aware that many of the attributes in the yolo class won't have the same effect as they do here (though these also have no effect when training yolov3-spp.cfg to 43mAP)
After fixing the stride issue, I started training a yolov4-relu.cfg, using the same anchors as yolov3-spp, to isolate the training changes caused by the new architecture (i.e. all else being equal). Unfortunately the new training (orange) is coming in below the yolov3-spp metrics (blue), at least at this early stage. Training will take about two weeks, so I'll leave it running. The only difference between this and what you would run @WongKinYiu is Mish and the updated anchors.
@glenn-jocher Hello,
Do you train https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-620270472 using V100 GPU? I can not train yolov4 even set batch=4 due to OOM.
@WongKinYiu yes I know! The gpu memory usage in pytorch for yolov4 is off the charts. See https://github.com/ultralytics/yolov3/issues/1098#issuecomment-620194657
The above plot is training yolov4-relu.cfg, which just swaps the mish activations for relu. I can't really train normal yolov4 either, it's just not practical. In addition to the 3X gpu memory usage, the train speed is about 2X slower, due primarily to the smaller batches.
@WongKinYiu but the answer your question I'm training on T4. It's slower but more economical, and 15GB mem. 1 epoch with yolov4.cfg takes about 150 minutes. One epoch of yolov4-relu.cfg takes about 70 min, only slightly slower than yolov3-spp.cfg, and about the same memory.
@glenn-jocher
Thanks, currently i change backbone to CSPDarknet53s and train with batch=4.
additional one question, do P3-P6 and P3-P7 models can be normal trained using latest repo? or i need to modify parts of code?
@WongKinYiu @AlexeyAB was reviewing effientdet paper again and was surprised by a training value I'd missed before. Their weight decay is 4e-5, batch-size 128.
In ultralytics/yolov3, (and darknet and ASFF also), weight decay is about 10x larger, and also applied more frequently I believe since we use batch-size 64 (weight decay is applied once per optimizer update every 64 images). This seems like quite a large discrepancy. I verified the value in the official repo as well: https://github.com/google/automl/blob/17637b428d46b002f4586b9541f6b7bbf2fab4bf/efficientdet/hparams_config.py#L213
@glenn-jocher Hi,
Do you want try to use
decay=0.00004
instead of? https://github.com/AlexeyAB/darknet/blob/6cbb75d10b43a95f11326a2475d64500b11fa64e/cfg/yolov4.cfg#L11
Did you check that correct anchors/masks for P3-P6 gives better accuracy?
@AlexeyAB no, I have not tried training P6 yet. I'm busy trying to validate my new repo's training against my existing yolov3 repo. I made too many changes too fast and just discovered I wasn't doing weight decay properly, which caused a surge in mAP early on, in epochs 0-100, but then a peak at 100 and overtraining afterward, while the original ultralytics/yolov3 with weight decay trained nice and steady to a higher peak at 270ish out of 300. But in the process of double checking the weight decay I noticed efficientdet uses a much lower value. I don't have free GPUs now, but when I get the new repo launched I will train a new model at a lower weight decay to compare.
P6 looks like it would actually be easier to add on a PANet like yolov4 than FPN like yolov3 in any case.
The current yolov4-relu training looks like this. This is a special yolov4 training just to compare architecture change effects in the absence of all the additional changes. So far I'm not seeing an improvement in the yolov4 architecture (orange) vs yolov3-spp (blue). We will have to wait a long time for this result, training is poking along at about 20 epochs per day on a T4.
@AlexeyAB @glenn-jocher
I am training a P6 model on a single 2080ti. ~20 epochs per day.
yolov4-mish ~10 epochs per day due to batch size.
@glenn-jocher
So far I'm not seeing an improvement in the yolov4 architecture (orange) vs yolov3-spp (blue).
May be some advantage of yolov4-architecutre: CSP + PAN (instead of FPN) - can be achieved only by using pre-trained weights-file that is trained with BoF+BoS+Mish on ImageNet? Or large model should be trained longer.
I don't have free GPUs now, but when I get the new repo launched I will train a new model at a lower weight decay to compare.
Do you use decay=0.0005
now?
And where do you get free GPUs?
@AlexeyAB well, all of my gpus at the moment are from a GCP credit that Ultralytics received when we participated in an accelerator last year called Decelera, in Mayakoba Mexico. I'm not sure if it's $20k or possibly $100k, but to make the most efficient use of the credits, i.e. the most epochs/$ I'm training on T4's at about $400 each per month. Unfortunately they are quite slow, about 2-3X slower than a 2080 ti, but they do come with 15G RAM, which is nice.
Day to day tests I run on Colab, since I don't actually have any usable local GPUs. I do all my work on a macbook pro, which does not support cuda egpus due to some ridiculous fight between apple and nvidia.
I'm waiting for the 3080 ti's to come out later on this year and then I think I might finally buy a box for myself, probably a 4-gpu box from lambda labs for about $8k.
Yes the pretraining might be the missing link. I do all of my training from scratch actually, after I saw better results this way in a side by side comparison last year. Unfortunately I usually see earlier overfitting on coco when using pretrained weights.
@AlexeyAB yes I'm using the same weight decay as here, 5E-4
@glenn-jocher @AlexeyAB
I just finish training CSPDarknet53s-PANet-SPP with optimized anchor for 512x512 using ultrlytics.
Using python3 test.py --cfg cd53s.cfg --weights last.pt --img 512 --iou-thr 0.7
to test.
I get 43.2% AP_0.50:0.95.
Using python3 test.py --cfg cd53s.cfg --weights last.pt --img 608 --iou-thr 0.7
to test.
I get 44.4% AP_0.50:0.95.
@WongKinYiu @glenn-jocher Nice!
--iou-thr 0.7
use IOU_thresh=0.7 or 0.5...0.95 for AP calculation?--augment
?@AlexeyAB
Do you get this result on valid or test-dev eval server?
- it is 5k set, i would like to evaluate test-dev set tomorrow. Does --iou-thr 0.7 use IOU_thresh=0.7 or 0.5...0.95 for AP calculation?
- 0.7 is iou threshold of nms. What was fixed to get good results?
- just fix the stride order of yolo layers. So you don't use --augment ?
- no, i do not use --augment.
by the way, if use best.pt
, it gets 43.4%/44.5% with input resolution 512x512/608x608.
@WongKinYiu
So you get 44.4% AP50...95 on 5k-valid dataset, it gives +1.3% AP
compared to Yolov3-spp 43.1% AP50...95. https://github.com/ultralytics/yolov3#map
What mini-batch size did you use?
Yolov4 416x416 gives 47.1% AP50...95 on 5k-valid eval server.
@WongKinYiu ah great! 44.5 is a great result, and yes 0.7 --iou is best for mAP@0.5:0.95.
Can you post your results.png here? Also can you link to this cfg?
Yes, in general you should probably always use best.pt after training is complete. last.pt can be used to --resume for example, but in most/all? cases best.pt should provide the best results.
@AlexeyAB yes this should be an apples to apples comparison, current mAP for yolov3-spp on coco2014 is 43.1 vs 44.5, so +1.4.
How do you get 47.1mAP?? That sounds extremely good, what's the catch?
@glenn-jocher This is for 5k-val set, not for test-dev.
YOLOv4 416x416:
47.1%
AP50...9541.2%
AP50...95@WongKinYiu thanks, I compared the cfg with https://github.com/ultralytics/yolov3/blob/master/cfg/yolov4-relu.cfg, and the only difference is 4 convolutions and 2 routes have been commented out. I'm surprised that you got such a good result then, because when I trained yolov4-relu.cfg up to 100 epochs it was still training yolov3-spp (see https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-622516664) so I cancelled the training. Can you post your results.txt so I can plot side by side with yolov3-spp and my yolov4-relu training?
@WongKinYiu also that's interesting that your best.pt is 500 MB, this means that the optimizer is packaged with the weights. I commited a change about a month ago to strip the optimizers from best.pt and last.pt once training was complete, so this means you have a slightly dated version of the repo. Do you know which git hash your repo has? It would be good of me to compare to make sure I haven't made any large changes since then, since your version seems to be working well.
@glenn-jocher
I use the 1 May repo.
I always gets error at this line https://github.com/ultralytics/yolov3/blob/master/test.py#L211 when use python train.py ...
.
So I can not evaluate the performance during training.
But I can run python test.py ...
without any error...
And the strip function is at https://github.com/ultralytics/yolov3/blob/master/train.py#L365 which is behind the evaluation process.
I think it is the reason why the step of "strip the optimizers from best.pt
and last.pt
once training was complete" is not be processed.
by the way, p6-model gets only 40% AP.
@WongKinYiu oh, I understand. Training completes 300 epochs, and then tries to use pycocotools for final mAP, then crashes, so strip function is not run.
Great, that is very recent, there are essentially zero changes that should affect training and testing in that time compared to the current repo!
Could you post your results.txt file here for cd53s?
@WongKinYiu btw, this is a numpy-pycocotools bug. If you install numpy == 1.17 it resolves the issue.
@glenn-jocher
Could you post your results.txt file here for cd53s?
Due to the numpy-pycocotools bug, the strip function and rename steps are not processed. My results.txt is already covered by new training.
@WongKinYiu hmm ok. Maybe I can try and get it directly from the model, since the training info was never stripped from it it may still be there. I'll try and do that and plot it against https://github.com/AlexeyAB/darknet/issues/4346#issuecomment-622516664
I was able to retrieve the training results from the model file. I plotted against yolov3-spp (43.1mAP) and yolov4-relu (training cancelled after 100 epochs). Results are overall very similar, though overtraining seems to be a bit less, and objectness in particular looks a bit different. What was your training command for this training?
EfficientDet: Scalable and Efficient Object Detection