YOLOX-nano is slower than YOLOX-tiny of PyTorch CPU

james77777778 commented 3 years ago

Thank you for sharing this project!

I encounter a strange problem that YOLOX-nano inference speed is abnormal when I use PyTorch with CPU I tried to benchmark different model on a arm64 computer and I use following script

import os
import time

import torch
import cv2
import numpy as np
from yolox.data.data_augment import preproc as preprocess
from yolox.utils import fuse_model, postprocess

from exps.default.nano import Exp as nano_exp
from exps.default.yolox_tiny import Exp as tiny_exp

MODEL_PATH = "checkpoints/yolox_nano.pth"  # "checkpoints/yolox_tiny.pth"
EXP = nano_exp  # tiny_exp
URL = 0
TOTAL_SECONDS = 60 * 60 * 24 * 365
CONF_THRESHOLD = 0.5
NMS_THRESHOLD = 0.65
NUM_CLASSES = 80
mean = (0.485, 0.456, 0.406)
std = (0.229, 0.224, 0.225)
input_shape = (416, 416)

if __name__ == '__main__':
    # open webcam
    cam = cv2.VideoCapture(URL)

    # model
    model = EXP().get_model()
    num_classes = NUM_CLASSES
    ckpt = torch.load(MODEL_PATH, map_location="cpu")
    model.load_state_dict(ckpt["model"])
    model = fuse_model(model)
    model.eval()

    # records
    cv2_read_times, inference_times, total_times = [], [], []
    now_seconds = 0

    print("----------START----------")
    start_time = time.perf_counter()
    try:
        while True:
            # read from stream
            cv2_read_st = time.perf_counter()
            ret, img = cam.read()
            cv2_read_ed = time.perf_counter()

            # inference
            forward_st = time.perf_counter()
            img, ratio = preprocess(img, input_shape, mean, std)
            img = torch.from_numpy(img).unsqueeze(0)
            with torch.no_grad():
                outputs = model(img)
                outputs = postprocess(outputs, num_classes, CONF_THRESHOLD, NMS_THRESHOLD)
            forward_ed = time.perf_counter()

            # record
            cv2_read_times.append(cv2_read_ed - cv2_read_st)
            inference_times.append(forward_ed - forward_st)
            total_times.append(forward_ed - cv2_read_st)

            # show
            elsped_time = time.perf_counter()
            if elsped_time - start_time > now_seconds:
                print("Read: {:.5f} ms, Inference: {:.2f} ms {:.2f} FPS, Total: {:.2f} FPS".format(
                    np.mean(cv2_read_times),
                    np.mean(inference_times) * 1000,
                    1. / np.mean(inference_times),
                    1. / np.mean(total_times)
                ))
                cv2_read_times.clear()
                inference_times.clear()
                total_times.clear()
                now_seconds += 1

            if elsped_time - start_time > TOTAL_SECONDS:
                break

    except KeyboardInterrupt:
        cam.release()
    finally:
        print("-----------END-----------")

In my device (ODROID-C4) YOLOX-tiny gets 0.75 FPS YOLOX-nano gets 0.32 FPS (far slower than YOLOX-tiny)

But when I test with onnxruntime YOLOX-tiny gets 1.3 FPS YOLOX-nano gets 3.2 FPS and the result is fine

I can confirm the preprocessing and postprocessing spent similar time and the FPS gap resulted from model inference

maybe this is device-specific problem?! (expensive operation in YOLOX-nano?)

Joker316701882 commented 3 years ago

Yeah, this is a device-specific problem. Depth-wise Conv in YOLOX-nano is not friendly to the device with a low memory access band. Also, it may be caused by the sub-optimization of DWConv in Pytorch. Maybe you can try TorchScript.

james77777778 commented 3 years ago

Thanks for quick reply!

I think I will stick with onnxruntime for faster inference speed

Megvii-BaseDetection / YOLOX

YOLOX-nano is slower than YOLOX-tiny of PyTorch CPU #101