Bad performance for some operation like transforming image to ndarray #1278

I tested the processing time of some model on djl and libtorch of python, Im sure djl keep the same performance compared to cpp or python if only count the pure inference time. But there are some operation on djl with bad performance. Like Image.toNDArray, It cost nearly 0.01s. Even the pure inference time of yolov5s cost only 0.008s. the similar operation(to_tensor of torchvision) on python costs only 0.005s. If there are any solution to improve the performance?

The operations I mentioned above are not the only ones with poor performance, like YoloTranslator provided in this project, which is generally worse(cost 0.027s) than one on python with full object detection process(from opencv to nms, cost 0.014s)

the code used for the test of Image.toNDArray,

    public void Image2NDArrayTest() throws Exception {
        int inpNum = 1000;
        int height = 640, width = 640;

        Scalar white = new Scalar(0, 0, 0);
        Image img = ImageUtils.mat2Image(new Mat(height, width, CvType.CV_8UC3, white));

        NDManager ndManager = NDManager.newBaseManager(Device.gpu());

        System.out.printf("start(%d times)\n", inpNum);
        long startTime = System.currentTimeMillis();
        for (int i = 0; i < inpNum; i++) {
            try (NDManager subManager = ndManager.newSubManager()) {
        long endTime = System.currentTimeMillis();
        System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);

the code for the test of translator

    public <IN, OUT> void test(ZooModel<IN, OUT> model, IN inp, int inpNum, int warmupNum) throws Exception {

        try (Predictor<IN, OUT> predictor = model.newPredictor()) {
            if (warmupNum > 0) {
                System.out.printf("warming up(%dtimes)\n", warmupNum);
                for (int i = 0; i < warmupNum; i++) {
            System.out.printf("testing(%dtimes)\n", inpNum);
            long startTime = System.currentTimeMillis();
            for (int i = 0; i < inpNum; i++) {
            long endTime = System.currentTimeMillis();
            System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);

the code for the test of yolov5 on python

def yolov5_test():
    yolov5_weight = './weights/'
    device = 'cuda'
    imgs_num = 1000
    height, width = 640, 640
    detector = YoloV5Detector(yolov5_weight, device)
    test(detector.detect, lambda: np.zeros((height, width, 3), int), imgs_num)
def test(model, inp, inp_num, warmup_num=50):
    if warmup_num > 0:
        for _ in range(warmup_num):
    start_time = time.time()
    for _ in range(inp_num):
    end_time = time.time()
    print(f"time: {(end_time - start_time) / inp_num} s/img")

the implement of Yolov5Detector

class YoloV5Detector:
    def __init__(self, weights, device):
        self.device = device
        self.model = torch.jit.load(weights).to(device)
        self.conf_thres = 0.35
        self.iou_thres = 0.45
        self.agnostic_nms = False
        self.max_det = 1000
        self.classes = [0]
        self.transformer = transforms.Compose([transforms.ToTensor()])
        # 预热
        _ = self.model(torch.zeros(1, 3, 640, 480).to(self.device))

    def preprocess_img(self, img):

        return self.transformer(img[:, :, ::-1].copy()).unsqueeze(0).to(self.device, dtype=torch.float32)

    def detect(self, img):
        # 预处理
        img = self.preprocess_img(img)
        # 检测
        pred = self.model(img)[0]
        # NMS
        pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, self.classes, self.agnostic_nms,
        pred = pred[0].detach().cpu()
        return pred
frankfliu commented 2 years ago

@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.

I took a further test today. It is the problem of java.awt.BufferedImage.getRGB() which costs more than 0.006s. I think some objects in djl are over-encapsulated, it may be better for programmers to operate NDList/NDArray or customized light intermediate data directly.

frankfliu commented 2 years ago

@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own ImageFactory using high performance native implementation like OpenCV.

thanks for reply, I get it,

xwaeaewcrhomesysplug commented 2 years ago

sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call?

hongyaohongyao commented 2 years ago

sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call? I guess

  1. you can see the implementation of BufferedImageFactory
  2. submanager for successing environment from parent manager, releasing ndarray automatically and preventing memory leaks
steinhae commented 2 years ago

Removing the parallel from BufferedImageFactory.fromNDArray:111 did the trick for me. Seems like overhead for parallel operation is bigger than the gain here.

frankfliu commented 2 years ago

