Closed hongyaohongyao closed 2 years ago
The operations I mentioned above are not the only ones with poor performance, like YoloTranslator provided in this project, which is generally worse(cost 0.027s) than one on python with full object detection process(from opencv to nms, cost 0.014s)
the code used for the test of Image.toNDArray,
@Test
public void Image2NDArrayTest() throws Exception {
int inpNum = 1000;
int height = 640, width = 640;
Scalar white = new Scalar(0, 0, 0);
Image img = ImageUtils.mat2Image(new Mat(height, width, CvType.CV_8UC3, white));
NDManager ndManager = NDManager.newBaseManager(Device.gpu());
System.out.printf("start(%d times)\n", inpNum);
long startTime = System.currentTimeMillis();
for (int i = 0; i < inpNum; i++) {
try (NDManager subManager = ndManager.newSubManager()) {
img.toNDArray(subManager);
}
}
long endTime = System.currentTimeMillis();
System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
}
the code for the test of translator
public <IN, OUT> void test(ZooModel<IN, OUT> model, IN inp, int inpNum, int warmupNum) throws Exception {
try (Predictor<IN, OUT> predictor = model.newPredictor()) {
if (warmupNum > 0) {
System.out.printf("warming up(%dtimes)\n", warmupNum);
for (int i = 0; i < warmupNum; i++) {
predictor.predict(inp);
}
}
System.out.printf("testing(%dtimes)\n", inpNum);
long startTime = System.currentTimeMillis();
for (int i = 0; i < inpNum; i++) {
predictor.predict(inp);
}
long endTime = System.currentTimeMillis();
System.out.printf("time: %f s/img%n\n", (endTime - startTime) / 1000.0 / inpNum);
}
}
the code for the test of yolov5 on python
def yolov5_test():
yolov5_weight = './weights/yolov5s.torchscript.pt'
device = 'cuda'
#
imgs_num = 1000
height, width = 640, 640
detector = YoloV5Detector(yolov5_weight, device)
test(detector.detect, lambda: np.zeros((height, width, 3), int), imgs_num)
def test(model, inp, inp_num, warmup_num=50):
if warmup_num > 0:
print(f"warming({warmup_num}times)")
for _ in range(warmup_num):
model(inp())
print(f"testing({inp_num}times)")
torch.cuda.synchronize()
start_time = time.time()
#
for _ in range(inp_num):
model(inp())
#
torch.cuda.synchronize()
end_time = time.time()
print(f"time: {(end_time - start_time) / inp_num} s/img")
the implement of Yolov5Detector
class YoloV5Detector:
def __init__(self, weights, device):
self.device = device
self.model = torch.jit.load(weights).to(device)
self.conf_thres = 0.35
self.iou_thres = 0.45
self.agnostic_nms = False
self.max_det = 1000
self.classes = [0]
self.transformer = transforms.Compose([transforms.ToTensor()])
# 预热
_ = self.model(torch.zeros(1, 3, 640, 480).to(self.device))
def preprocess_img(self, img):
return self.transformer(img[:, :, ::-1].copy()).unsqueeze(0).to(self.device, dtype=torch.float32)
def detect(self, img):
# 预处理
img = self.preprocess_img(img)
# 检测
pred = self.model(img)[0]
# NMS
pred = non_max_suppression(pred, self.conf_thres, self.iou_thres, self.classes, self.agnostic_nms,
max_det=self.max_det)
pred = pred[0].detach().cpu()
return pred
@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.
@hongyaohongyao Thanks for reporting this issue. Will take a look Image.toNDArray() performance issue.
I took a further test today. It is the problem of java.awt.BufferedImage.getRGB() which costs more than 0.006s. I think some objects in djl are over-encapsulated, it may be better for programmers to operate NDList/NDArray or customized light intermediate data directly.
@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own ImageFactory
using high performance native implementation like OpenCV.
@hongyaohongyao If BufferedImage is bottleneck, you can consider create your own
ImageFactory
using high performance native implementation like OpenCV.
thanks for reply, I get it,
sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call?
sorry,a bit curious but how is the ImageFactory related to imageobj.toNDarray(); ? I thought it uses NDmanager only? also why do he need to create a new submanager?i thought it would create more overhead?especially in a try call? I guess
- you can see the implementation of BufferedImageFactory
- submanager for successing environment from parent manager, releasing ndarray automatically and preventing memory leaks
Removing the parallel from BufferedImageFactory.fromNDArray:111 did the trick for me. Seems like overhead for parallel operation is bigger than the gain here.
Now we have OpenCV extension.
I tested the processing time of some model on djl and libtorch of python, Im sure djl keep the same performance compared to cpp or python if only count the pure inference time. But there are some operation on djl with bad performance. Like Image.toNDArray, It cost nearly 0.01s. Even the pure inference time of yolov5s cost only 0.008s. the similar operation(to_tensor of torchvision) on python costs only 0.005s. If there are any solution to improve the performance?