dmlc / gluon-cv

Gluon CV Toolkit
http://gluon-cv.mxnet.io
Apache License 2.0
5.79k stars 1.21k forks source link

GPU inference too slow #955

Closed TomMao23 closed 3 years ago

TomMao23 commented 4 years ago

gluoncv's CPU inference speed is relatively fast, but GPU inference seems too slow. mobilenet1.0_yolo3 test in GTX1080ti(cuda9.0) only have 16FPS (416*416). the code is like the tutorials.

chinakook commented 4 years ago

You can use tvm to speed it up. After some op fusion, you will get 2x speedup. But you should learn how to compile your model with tvm to cudnn(not cuda) backend.

TomMao23 commented 4 years ago

Thanks. I have tried tvm acceleration before, like this tutorials. But here the target is cuda, cudnn backend has a similar tutorial?

TomMao23 commented 4 years ago

https://docs.tvm.ai/tutorials/frontend/deploy_ssd_gluoncv.html#sphx-glr-tutorials-frontend-deploy-ssd-gluoncv-py

chinakook commented 4 years ago

Refer to https://discuss.tvm.ai/

Jerryzcn commented 4 years ago

which version of mxnet are you using?

TomMao23 commented 4 years ago

which version of mxnet are you using?

test on both mxnet-cu90mkl==1.5.0 and 1.4.1
gtx1080ti cuda9.0.176 cudnn 7.0.5

import time
import os
import cv2
import gluoncv as gcv
import mxnet as mx
from mxnet import nd
from gluoncv.data.transforms import image as timage

def pre_process(frame):
    frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
    img = timage.imresize(frame, 320, 320, interp=9)
    #orig_img = img.asnumpy().astype('uint8')
    img = mx.nd.image.to_tensor(img)
    img = mx.nd.image.normalize(img, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))
    tensor = img.expand_dims(0)
    return tensor

def main():
    classes = ["truck", "car", "van", "bus"]
    net = gcv.model_zoo.get_model('yolo3_mobilenet1.0_custom', classes=classes, pretrained_base=False)
    net.load_parameters('mobilenet1.0_yolo3_highway_0078_0.7039.params', ctx=mx.gpu(0))
    net.hybridize()

    src = "./china_highway_dataset/images/"
    imgs = sorted(os.listdir(src))

    for i, filename in enumerate(imgs):
        print(filename)
        # Load frame
        frame = cv2.imread(src+filename)
        # Image pre-processing
        tensor = pre_process(frame)
        # Run frame through network
        class_IDs, scores, bounding_boxes = net(tensor.as_in_context(mx.gpu(0)))
        class_IDs.wait_to_read()
        # Display the result
        #img = gcv.utils.viz.cv_plot_bbox(orig_img, bounding_boxes[0], scores[0], class_IDs[0], class_names=net.classes)
        #gcv.utils.viz.cv_plot_image(img)
        #cv2.waitKey(1)
        if i == 1:
            time1 = time.time()
    print((len(imgs)-2)/(time.time()-time1))

if __name__ == "__main__":
    main()
zhreshold commented 4 years ago

You should use large batch to speed up YOLO3 using GPU. With single image, the major bottleneck is not the computation, but data copy.

TomMao23 commented 4 years ago

You should use large batch to speed up YOLO3 using GPU. With single image, the major bottleneck is not the computation, but data copy.

Speed-up from batch-size greater than 1 is unusable and meaningless in most applications. Copying data from CPU to GPU is indeed an important bottleneck, but from the current FPS (16), model inference is still the main bottleneck. I will try other methods to seek speed-up while maintaining the batch-size is 1.

TomMao23 commented 4 years ago
sym, arg_params, aux_params = mx.model.load_checkpoint('yolo3_mobilenet1.0_coco', 0)
batch_shape=(1,3,320,320)
# Execute with MXNet
os.environ['MXNET_CUDNN_AUTOTUNE_DEFAULT'] = '1'
executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', force_rebind=True)
executor.copy_params_from(arg_params, aux_params)

# Warmup
print('Warming up MXNet')
for i in range(0, 1):
    y_gen = executor.forward(is_train=False, data=input)
    y_gen[0].wait_to_read()

# Timing
print('Starting MXNet timed run')
start = time.time()
for i in range(0, 100):
    y_gen = executor.forward(is_train=False, data=mx.nd.ones(batch_shape))
    y_gen[0].wait_to_read()
end = time.time()
print(end - start)

I recently experimented with another piece of code. It shows that yolo3_mobilenet1.0 has 140FPS on my machine (GTX1080Ti,140 times for 1.03s). This is consistent with the speed that the model should have. I think the speed of the MXNet model is indeed so fast, it may be that preprocessing or other steps slow down. This phenomenon is also common in other models and their official tutorial inference demo. At present, I have not figured out which step has slowed down the time, but my previous demo was written in the same way as the gluoncv official website tutorial (and also set 'MXNET_CUDNN_AUTOTUNE_DEFAULT' = '1' )

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.