why run so slowly? - Githubissues

123wk45678 commented 4 years ago

i run camera.py on CPU,why so slowly?

Tianxiaomo commented 4 years ago

Yes, it is. I don't know why.

yangzhegithub commented 4 years ago

@Tianxiaomo gpu 也慢。。

vraivon commented 4 years ago

I also implemented a version , and for one image with 1024*768 the average inference time on 1080ti(~64ms) is 20ms slower than v3(~44ms)

vraivon commented 4 years ago

According to https://github.com/AlexeyAB/darknet/issues/5308#issuecomment-619316320, v4 could get a better accuracy but a slower inference time for same input size.

ersheng-ai commented 4 years ago

Default is CPU mode

timelesszxl commented 4 years ago

@Tianxiaomo 在tool/utils.py中423行到447行中关于检测框的后处理速度很慢，原因是for循环写的太多了。这块儿需要重构一下代码 😁

Tianxiaomo commented 4 years ago

@timelesszxl 很明显，循环那用的时间并不多

image to tensor : 0.005000 tensor to cuda : 0.000000 predict : 5.035578 nms : 0.000000 for : 0.065024 total : 5.040578

image to tensor : 0.005001 tensor to cuda : 0.000000 predict : 5.294691 nms : 0.000000 for : 0.060999 total : 5.299691

timelesszxl commented 4 years ago

@Tianxiaomo GPU的话除开第一次加载时间后面单张图片你的predict的时间是20ms for循环是60ms 这样一看确实很多了 forward时间只有20ms 后处理却达到了60ms

timelesszxl commented 4 years ago

我测试的图片分辨率 2560*1440 GPU RTX 2080Ti image to tensor: 0.001167 tensor to cuda: 0.000850 predict: 0.085939 model: 0.0200 for: 0.06665 nms: 0.000687 total: 0.088644

for循环时间咱们是一致的不过确实对于yolov4这块儿慢了我先看一下我能否优化如果我能优化好的话我会放出代码
@Tianxiaomo 作者也可以看一下，握手

timelesszxl commented 4 years ago

补充一下：我把predict分成了model+for

choubin commented 4 years ago

为什么我gpu:1080ti,要用108ms这么久？

ersheng-ai commented 4 years ago

为什么我gpu:1080ti,要用108ms这么久？

最后的画bbox的过程可能是瓶颈，画bbox这里有两种方式，in model （用tensor来画），out model （用numpy来画），你可以看一下yolo_layer.py里面的get_region_boxes_in_model调用，还有get_region_boxes_out_model

GlassyWing commented 4 years ago

The author's post-processing uses too many loops, and even 3 levels of nesting. In my own implementation, the inference speed (prediction + post-processing) is only 30-40ms (1070ti), but the author's is 250-300ms.

ersheng-ai commented 4 years ago

The author's post-processing uses too many loops, and even 3 levels of nesting. In my own implementation, the inference speed (prediction + post-processing) is only 30-40ms (1070ti), but the author's is 250-300ms.

I have tried to move back almost all post-processing procedures (except NMS) into the model. There is a new method named yolo_foward which is ONNX compatible and alternative of the original get_region_boxes_in_model or get_region_boxes_out_model.

You can try to pull the latest code.

GlassyWing commented 4 years ago

The author's post-processing uses too many loops, and even 3 levels of nesting. In my own implementation, the inference speed (prediction + post-processing) is only 30-40ms (1070ti), but the author's is 250-300ms.

I have tried to move back almost all post-processing procedures (except NMS) into the model. There is a new method named yolo_foward which is ONNX compatible and alternative of the original get_region_boxes_in_model or get_region_boxes_out_model.

You can try to pull the latest code.

em..., The test results are measured by the latest code.

KelvinHuang666 commented 4 years ago

@GlassyWing can you show me your code for prediction?Thanks 谢谢

KelvinHuang666 commented 4 years ago

@GlassyWing I predict one images（640*320）using 160 ms with taitan。its too slow。

ersheng-ai commented 4 years ago

@GlassyWing I predict one images（640*320）using 160 ms with taitan。its too slow。

I have tried on Tesla T4, the python post process code is the bottle net (over 0.15s) Improvements will be done later.

-----------------------------------
          Preprocess : 0.002534
     Model Inference : 0.034289
-----------------------------------
-----------------------------------
     get_region_boxes : 0.113353
                  nms : 0.037506
   post process total : 0.150865
-----------------------------------

GlassyWing commented 4 years ago

@GlassyWing can you show me your code for prediction?Thanks 谢谢

ok, the source code could be found at https://github.com/GlassyWing/yolo3_deepsort, It also supports yolo4. If you want to know the prediction progress, just see https://github.com/GlassyWing/yolo3_deepsort/blob/master/yolo3/models/models.py#L149, line 149 to 192

ersheng-ai commented 4 years ago

Running time of get_region_boxes() is eliminated after latest push.

-----------------------------------
          Preprocess : 0.002126
     Model Inference : 0.036722
-----------------------------------
-----------------------------------
     get_region_boxes : 0.000471
                  nms : 0.026056
   post process total : 0.026533
-----------------------------------

ersheng-ai commented 4 years ago

This is latest inference time for each iteration when input is 416*416 and batch_size=1 on Tesla T4

-----------------------------------
           Preprocess : 0.001206
      Model Inference : 0.034135
-----------------------------------
-----------------------------------
       max and argmax : 0.003229
                  nms : 0.000674
Post processing total : 0.003903
-----------------------------------
Predicted in 0.043026 seconds.

I will close this issue.

Tianxiaomo / pytorch-YOLOv4

why run so slowly? #2

@timelesszxl 很明显，循环那用的时间并不多

image to tensor : 0.005000 tensor to cuda : 0.000000 predict : 5.035578 nms : 0.000000 for : 0.065024 total : 5.040578

image to tensor : 0.005001 tensor to cuda : 0.000000 predict : 5.294691 nms : 0.000000 for : 0.060999 total : 5.299691