Performance bottlenecks on CPU

harrylal commented 11 months ago

A big shoutout to the amazing folks who brought the TwinLiteNet model to life.I'm genuinely impressed by what you've accomplished here. Thanks a million for your outstanding contribution! 🙌👏.

I'd like to discuss a matter where I could use your insights. I've been running the TwinLiteNet model on an Intel i9 CPU and it's been delivering about 0.5 frames per second (fps). In comparison, I've observed that YOLOv8n, despite having significantly more parameters, achieves around 15 fps on the same CPU.

Current Behavior:

The TwinLiteNet model performs at approximately 0.5 fps on the specified CPU configuration.

Expected Behavior:

I'm seeking insights on what could be possible bottlenecks, tips, or tweaks that can potentially improve the TwinLiteNet model's CPU inference speed to achieve better performance, even in comparison to YOLOv8n with significantly more parameters.

Steps to Reproduce:

Modify and run test_image.py as below for inference on CPU and FPS logging. Reference - https://github.com/chequanghuy/TwinLiteNet/issues/2#issuecomment-1667666914

import torch
import numpy as np
import shutil
from tqdm.autonotebook import tqdm
import os
import os
import torch
from model import TwinLite as net
import cv2
import time 

def Run(model,img):
    img = cv2.resize(img, (640, 360))
    img_rs=img.copy()

    img = img[:, :, ::-1].transpose(2, 0, 1)
    img = np.ascontiguousarray(img)
    img=torch.from_numpy(img)
    img = torch.unsqueeze(img, 0)  # add a batch dimension
    img = img.float()/ 255.0
    with torch.no_grad():
        start_time  = time.time()
        img_out = model(img)
        print("FPS: ", 1.0 / (time.time() - start_time))

    x0=img_out[0]
    x1=img_out[1]

    _,da_predict=torch.max(x0, 1)
    _,ll_predict=torch.max(x1, 1)

    DA = da_predict.byte().cpu().data.numpy()[0]*255
    LL = ll_predict.byte().cpu().data.numpy()[0]*255
    img_rs[DA>100]=[255,0,0]
    img_rs[LL>100]=[0,255,0]

    return img_rs

model = net.TwinLiteNet()
model = torch.nn.DataParallel(model)
model.load_state_dict(torch.load('pretrained/best.pth'))
model = model.module.cpu()
model.eval()

image_list=os.listdir('images')
shutil.rmtree('results')
os.mkdir('results')
for i, imgName in enumerate(image_list):
    img = cv2.imread(os.path.join('images',imgName))
    img=Run(model,img)
    cv2.imwrite(os.path.join('results',imgName),img)

chequanghuy commented 11 months ago

@harrylal Thank you very much for your comments, I will try to find out the reason soon. Besides, I would also be very grateful if you could give your feedback when finding the cause.

harrylal commented 11 months ago

@chequanghuy Thank you for your quick response. I have conducted a thorough model profiling, and it appears that the encoder layer with 131 kernels may be contributing to the performance issue on CPU. I would highly value your insights on this. pytorch_profiler

chequanghuy / TwinLiteNet