How to improve speed on CPU? C++ code slower than python [YOLOv4-tiny]

juacamgo commented 3 years ago

Hello all,

First of all, want to thanks @AlexeyAB for the great work done here.

I am starting with ML models, and after a bit of research I've ended up training a YOLO-v4 tiny using the roboflow guide, and implemented code on Python and C++ to test inference.

I am planning to use the net to detect the position of a barcode on an image using CPU (so my ML model just needs a class, and I am wondering if even the tiny version of the YOLOv4 net is bigger than my needs).

The CPU my embedded system has is an IMX 6 Quad Processor @1GHz.

After having trained my model, I've implemented some code on Python (for first inference testing on Windows) and I've got about 23 fps. I thought that implementing the same code on C++ will boost a bit the inference, but it just happened the other way round.

With the C++ implementation I am capable of run inference at about 8-9 fps, using the same input size (416x416), while using Python I got around 23-24 fps.

So, here are my questions:

Why the Python code is faster than the C++ code?
Can I modify the net in some way to be faster? After all I am only working with a class, so it's why I am asking.
Do you think it's possible to get that CPU running a ML net at at least 10-15 fps?

And the code implemented for Python:

import cv2
import time

CONFIDENCE_THRESHOLD = 0.2
NMS_THRESHOLD = 0.4
COLORS = [(255, 0, 0)]
IMAGE = cv2.imread("test/azteccode_test.png")

class_names = ["barcode"]

net = cv2.dnn.readNet("yolov4tiny/barcode-yolov4-tiny_final.weights", "yolov4tiny/barcode-yolov4-tiny-detector.cfg")
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

model = cv2.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/255)

start = time.time()
classes, scores, boxes = model.detect(IMAGE, CONFIDENCE_THRESHOLD, NMS_THRESHOLD)
end = time.time()

start_drawing = time.time()
for (classid, score, box) in zip(classes, scores, boxes):
    color = COLORS[int(classid) % len(COLORS)]
    label = "%s : %f" % (class_names[classid[0]], score)
    cv2.rectangle(IMAGE, box, color, 2)
    cv2.putText(IMAGE, label, (box[0], box[1] - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
end_drawing = time.time()

fps_label = "FPS: %.2f (excluding drawing time of %.2fms" % (1 / (end - start), (end_drawing - start_drawing) * 1000)
cv2.putText(IMAGE, fps_label, (0, 450), cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 0), 2)
cv2.imshow("Dections", IMAGE)
cv2.waitKey(0)
cv2.destroyAllWindows()

C++:

int main()
{
    classes.push_back("barcode");

    Net net = readNetFromDarknet(cfg, weights);
    net.setPreferableBackend(DNN_BACKEND_OPENCV);
    net.setPreferableTarget(DNN_TARGET_CPU);

    auto totalStart = std::chrono::steady_clock::now();
    DetectionModel model(net);
    model.setInputParams(1.0 / 255, Size(416, 416));

    auto dnnStart = std::chrono::steady_clock::now();
    MatShape classIds;
    std::vector<float> confidences;
    std::vector<Rect> boxes;
    model.detect(image, classIds, confidences, boxes, CONFIDENCE_THRESHOLD, NMS_THRESHOLD);
    auto dnnEnd = std::chrono::steady_clock::now();

    for (int i = 0; i < boxes.size(); i++)
    {
        Rect box = boxes[i];
        Scalar color = (255, 0, 0);
        String label(" % s : % f", (classes[classIds[0]], confidences[i]));
        rectangle(image, box, color, 2);
        putText(image, label, Point(box.x, box.y - 10), FONT_HERSHEY_COMPLEX, 0.5, color, 2);
    }

    auto totalEnd = std::chrono::steady_clock::now();

    float inferenceFps = 1000.0 / std::chrono::duration_cast<std::chrono::milliseconds>(dnnEnd - dnnStart).count();
    float totalFps = 1000.0 / std::chrono::duration_cast<std::chrono::milliseconds>(totalEnd - totalStart).count();

    Draw::draw(image, inferenceFps, totalFps);

    return 0;
}

IMPORTANT EDIT: for the testing I was using OpenCV 32 bit compiled (because the final embedded system uses 32 bit), I've tried compile OpenCV for 64 bit and now inference goes from 11 fps (using 32 bit) to 32 fps (using 64 bits), that is about 7-8 fps more than the Python version. I will research a bit about that performance issue.

stephanecharette commented 3 years ago

Brief look at the code seems to show you're not using darknet, but instead the OpenCV dnn implementation. We have no idea how that is implemented, you'd have to ask the OpenCV folks. But considering how the python interface calls the C++ interface under the covers, it is impossible that the python side is faster. If you find that is the case, then there must be something different about the calls you're making in C++. But again, you'd have to ask the OpenCV folks.

juacamgo commented 3 years ago

Hello @stephanecharette , first of all thank you for the answer.

As pointed in my edit, I've found that using 64 bit compiled OpenCV "solves" the problem, and the C++ code is faster than the Python code.

Found this, I've asked to OpenCV about the issue of performance missmatch between 32-64 bits.

Can I ask you about 2) and 3) questions?

I will now try the Darknet implementation. Thank you!

DylanStouls commented 3 years ago

Hi @juacamgo

I assume that you are new to ML, so I apologize if my answer is « too obvious »

You can modify the network to be both faster and accurate. It depends on the desired detection. If you create your own config file using the advices in the readme.md file for an easy task, the result could lead to a performance drop. A « bigger » dnn isn’t always better. Detecting barcode in a constrained environment looks like an easy task. Maybe you could remove some filters or reduce the input to 320x320.

It’s hard to give you advices without your .cfg file, some examples of the desired output and the graph of your training.

juacamgo commented 3 years ago

Hi @DylanStouls

Yes, I am new to ML and I am exploring this world for the first time, so any tip or advice will be welcome!

When you said "the resuld could lead to a performance drop" did you mean that I can improve speed sacrificing some accuracy, sure?

Reducing the input to 320x320 I am able to improve speed from 28-32fps to 45 fps. Also, I thought that removing some convolutional filters I could improve the speed, but when I modify the cfg file I need to train again, sure?

I've attached my .cfg file. I will read carefully the readme.md and modify my .cfg file and test performance and accuracy.

Many thanks!

barcode-yolov4-tiny-detector.zip

DylanStouls commented 3 years ago

When you said "the resuld could lead to a performance drop" did you mean that I can improve speed sacrificing some accuracy, sure?

I mean that depending on the use case, using a bigger neural network could make your solution less accurate and slower. This phenomenon is called "overfitting". You can find a lot of informations about it online.

Reducing the input to 320x320 I am able to improve speed from 28-32fps to 45 fps. Also, I thought that removing some convolutional filters I could improve the speed, but when I modify the cfg file I need to train again, sure?

You are right, you will have to train again after removing convolutional filters.

I've attached my .cfg file. I will read carefully the readme.md and modify my .cfg file and test performance and accuracy.

There are many tips and explanations in the readme.md. For example, it's written that the max_batches parameters shouldn't be less than 6000.

Moreover, sometimes, it's not necessary to have RGB images. By changing the channel parameter to 1, you can train a greyscale DNN. Generally, we use less filters in a greyscale DNN and in some cases, it could lead to a better accuracy. For a barcode detector, I think that it's worth the try.

juacamgo commented 3 years ago

Hi @DylanStouls

Thank you for your clear explanations!

I've played a bit with the net, erased some convolutional filters and trained for 320x320 using 1 channel (for greyscale) and now the speed is about 60 fps.

For my surprise, as you said, now the mAP seems to be higher than before with a bigger net, and the net keeps detecting the barcodes very precise.

I will try now to compile for 32 bits arm systems and try on the embedded device to check performance.

Thank you very much!

AlexeyAB / darknet

How to improve speed on CPU? C++ code slower than python [YOLOv4-tiny] #7478