Tianxiaomo / pytorch-YOLOv4

PyTorch ,ONNX and TensorRT implementation of YOLOv4
Apache License 2.0
4.47k stars 1.49k forks source link

why run so slowly? #2

Closed 123wk45678 closed 4 years ago

123wk45678 commented 4 years ago

i run camera.py on CPU,why so slowly?

Tianxiaomo commented 4 years ago

Yes, it is. I don't know why. image

yangzhegithub commented 4 years ago

@Tianxiaomo gpu 也慢。。

vraivon commented 4 years ago

I also implemented a version , and for one image with 1024*768 the average inference time on 1080ti(~64ms) is 20ms slower than v3(~44ms)

vraivon commented 4 years ago

According to https://github.com/AlexeyAB/darknet/issues/5308#issuecomment-619316320, v4 could get a better accuracy but a slower inference time for same input size.

ersheng-ai commented 4 years ago

Default is CPU mode

timelesszxl commented 4 years ago

@Tianxiaomo 在tool/utils.py中423行到447行中关于检测框的后处理速度很慢,原因是for循环写的太多了。这块儿需要重构一下代码 😁

Tianxiaomo commented 4 years ago

@timelesszxl 很明显,循环那用的时间并不多

image to tensor : 0.005000 tensor to cuda : 0.000000 predict : 5.035578 nms : 0.000000 for : 0.065024 total : 5.040578


image to tensor : 0.005001 tensor to cuda : 0.000000 predict : 5.294691 nms : 0.000000 for : 0.060999 total : 5.299691

timelesszxl commented 4 years ago

@Tianxiaomo GPU的话 除开第一次加载时间 后面单张图片你的predict的时间是20ms for循环是60ms 这样一看 确实很多了 forward时间只有20ms 后处理却达到了60ms

timelesszxl commented 4 years ago

我测试的图片分辨率 2560*1440 GPU RTX 2080Ti image to tensor: 0.001167 tensor to cuda: 0.000850 predict: 0.085939 model: 0.0200 for: 0.06665 nms: 0.000687 total: 0.088644

for循环时间咱们是一致的 不过确实对于yolov4这块儿慢了 我先看一下我能否优化 如果 我能优化好的话 我会放出代码
@Tianxiaomo 作者也可以看一下,握手

timelesszxl commented 4 years ago

补充一下:我把predict分成了model+for

choubin commented 4 years ago

为什么我gpu:1080ti,要用108ms这么久?

ersheng-ai commented 4 years ago

为什么我gpu:1080ti,要用108ms这么久?

最后的画bbox的过程可能是瓶颈,画bbox这里有两种方式,in model (用tensor来画),out model (用numpy来画),你可以看一下yolo_layer.py里面的get_region_boxes_in_model调用,还有get_region_boxes_out_model

GlassyWing commented 4 years ago

The author's post-processing uses too many loops, and even 3 levels of nesting. In my own implementation, the inference speed (prediction + post-processing) is only 30-40ms (1070ti), but the author's is 250-300ms.

ersheng-ai commented 4 years ago

The author's post-processing uses too many loops, and even 3 levels of nesting. In my own implementation, the inference speed (prediction + post-processing) is only 30-40ms (1070ti), but the author's is 250-300ms.

I have tried to move back almost all post-processing procedures (except NMS) into the model. There is a new method named yolo_foward which is ONNX compatible and alternative of the original get_region_boxes_in_model or get_region_boxes_out_model.

You can try to pull the latest code.

GlassyWing commented 4 years ago

The author's post-processing uses too many loops, and even 3 levels of nesting. In my own implementation, the inference speed (prediction + post-processing) is only 30-40ms (1070ti), but the author's is 250-300ms.

I have tried to move back almost all post-processing procedures (except NMS) into the model. There is a new method named yolo_foward which is ONNX compatible and alternative of the original get_region_boxes_in_model or get_region_boxes_out_model.

You can try to pull the latest code.

em..., The test results are measured by the latest code.

KelvinHuang666 commented 4 years ago

@GlassyWing can you show me your code for prediction?Thanks 谢谢

KelvinHuang666 commented 4 years ago

@GlassyWing I predict one images(640*320)using 160 ms with taitan。its too slow。

ersheng-ai commented 4 years ago

@GlassyWing I predict one images(640*320)using 160 ms with taitan。its too slow。

I have tried on Tesla T4, the python post process code is the bottle net (over 0.15s) Improvements will be done later.

-----------------------------------
          Preprocess : 0.002534
     Model Inference : 0.034289
-----------------------------------
-----------------------------------
     get_region_boxes : 0.113353
                  nms : 0.037506
   post process total : 0.150865
-----------------------------------
GlassyWing commented 4 years ago

@GlassyWing can you show me your code for prediction?Thanks 谢谢

ok, the source code could be found at https://github.com/GlassyWing/yolo3_deepsort, It also supports yolo4. If you want to know the prediction progress, just see https://github.com/GlassyWing/yolo3_deepsort/blob/master/yolo3/models/models.py#L149, line 149 to 192

ersheng-ai commented 4 years ago

Running time of get_region_boxes() is eliminated after latest push.

-----------------------------------
          Preprocess : 0.002126
     Model Inference : 0.036722
-----------------------------------
-----------------------------------
     get_region_boxes : 0.000471
                  nms : 0.026056
   post process total : 0.026533
-----------------------------------
ersheng-ai commented 4 years ago

This is latest inference time for each iteration when input is 416*416 and batch_size=1 on Tesla T4

-----------------------------------
           Preprocess : 0.001206
      Model Inference : 0.034135
-----------------------------------
-----------------------------------
       max and argmax : 0.003229
                  nms : 0.000674
Post processing total : 0.003903
-----------------------------------
Predicted in 0.043026 seconds.

I will close this issue.