AlexeyAB / darknet

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )
http://pjreddie.com/darknet/
Other
21.77k stars 7.96k forks source link

Dlib Face Recognition + Yolov3 tracker #1030

Open dexception opened 6 years ago

dexception commented 6 years ago

@AlexeyAB

I am trying to combine the object tracking code you posted in the below issue with Dlib face recognition code. I already have trained a face detector in yolo and it is giving about 16 FPS with object tracking. https://github.com/AlexeyAB/darknet/issues/907

Dlib face recognition is slow primarily because of the face detector model they have. I want to replace it with yolov3 and object tracking.

So in the draw_boxes method inside yolo_console.cpp file. I am trying to add face recognition code from dlib....

static dlib::rectangle openCVRectToDlib(cv::Rect r)
{
    return dlib::rectangle((long)r.tl().x, (long)r.tl().y, (long)r.br().x - 1, (long)r.br().y - 1);
}

void draw_boxes(cv::Mat mat_img, std::vector<bbox_t> result_vec, std::vector<std::string> obj_names, 
    int current_det_fps = -1, int current_cap_fps = -1)
{
    std::vector<matrix<rgb_pixel>> unknown_faces;
    std::vector<matrix<float, 0, 1>> unknown_face_descriptors;
    cv_image<bgr_pixel> cv_image(mat_img);
    matrix<rgb_pixel> dlib_matrix_img;
    assign_image(dlib_matrix_img, cv_image);

    int const colors[6][3] = { { 1,0,1 },{ 0,0,1 },{ 0,1,1 },{ 0,1,0 },{ 1,1,0 },{ 1,0,0 } };

    for (auto &i : result_vec)
    {
        cv::Scalar color = obj_id_to_color(i.obj_id);

        cv::Rect r(i.x, i.y, i.w, i.h);
        cv::rectangle(mat_img, r, color, 2);

        dlib::rectangle face = openCVRectToDlib(r);
        auto shape = sp(dlib_matrix_img, face);
        matrix<rgb_pixel> face_chip;
        extract_image_chip(dlib_matrix_img, get_face_chip_details(shape, 150, 0.25), face_chip);
        unknown_faces.push_back(std::move(face_chip));

        if (obj_names.size() > i.obj_id)
        {
            std::string obj_name = obj_names[i.obj_id];
            if (i.track_id > 0) obj_name += " - " + std::to_string(i.track_id);
            cv::Size const text_size = getTextSize(obj_name, cv::FONT_HERSHEY_COMPLEX_SMALL, 1.2, 2, 0);
            int const max_width = (text_size.width > i.w + 2) ? text_size.width : (i.w + 2);
            cv::rectangle(mat_img, cv::Point2f(std::max((int)i.x - 1, 0), std::max((int)i.y - 30, 0)), 
                cv::Point2f(std::min((int)i.x + max_width, mat_img.cols-1), std::min((int)i.y, mat_img.rows-1)), 
                color, CV_FILLED, 8, 0);
            putText(mat_img, obj_name, cv::Point2f(i.x, i.y - 10), cv::FONT_HERSHEY_COMPLEX_SMALL, 1.2, cv::Scalar(0, 0, 0), 2);
        }
    }
    if (current_det_fps >= 0 && current_cap_fps >= 0) {
        std::string fps_str = "FPS detection: " + std::to_string(current_det_fps) + "   FPS capture: " + std::to_string(current_cap_fps);
        putText(mat_img, fps_str, cv::Point2f(10, 20), cv::FONT_HERSHEY_COMPLEX_SMALL, 1.2, cv::Scalar(50, 255, 0), 2);
    }
}

Well the whole objective is to have a face recognition system capable of processing 25 fps. I am just wondering whether the approach i am taking is good enough. I would like your opinion on this.

Thanks !

AlexeyAB commented 6 years ago

Fast object tracking usually is used if your detector can't achive real-time FPS, so detection will be applied for each N-frames (depends on GPU performance), and only tracking will be move detected bounded boxes for each frame.

dexception commented 6 years ago

@AlexeyAB I want to do it for N-Frames i.e as long as we can track the face. Dlib code for face recognition i am using is by calculating the distance between two vectors of matrix inside 2 for loops. So getting FPS of 2... I have compiled DLIB with cuda. This is with Dlib Face detector+Dlib landmark+Comparison of distance.

I have put the code for face recognition code inside draw_boxes method. With your tracking code i am getting 6-16 FPS. Yolov3 face detector+Dlib landmark+Comparison of distance

The draw_boxes method is taking 40 ms for each frame. (conversion of cv Rect to Dlib Rect+ Dlib landmark+Comparison of distance)

But still away from 25 FPS.

I don't understand your last point.

Do you have a personal id where i can send the entire code ?

AlexeyAB commented 6 years ago

The draw_boxes method is taking 40 ms for each frame. (conversion of cv Rect to Dlib Rect+ Dlib landmark+Comparison of distance)

But still away from 25 FPS.

I don't understand your last point.

I.e. Capturing+Tracking+draw_boxes()+Saving_video is launched for each frame, if video-stream from IP-camera has 30 FPS, then it these functions will be launched 30 times per second.

Detection (Yolo) is launched only for each N frame, N depends on GPU performance. I.e. if video-stream from IP-camera has 30 FPS, but your GPU can process only 5 FPS, then Detection (Yolo) will be launched only for each 6th frame.

So, if you get 6-16 FPS, but you want to get 25 FPS, you can buy new GPU or just do face-recognition only for each 6th frame (for example).

You can attach code here, or put url to the google-disk.

dexception commented 6 years ago

code: version2 https://drive.google.com/open?id=1oj8-GdJSSqF32A-nZbOQn-cMr3FHJOPY

version3 https://drive.google.com/file/d/15IDALtHAx2wAuu3RY1AinvSznLibBzVb

Having few problems editing the struct bbox_t

So created a new one:

struct bbox_face_map {
    int matchingBestIndex;
    float x;
    float y;
};

The FPS is varying in terms of speed and not smooth enough to be called stable..

dexception commented 6 years ago

@AlexeyAB I have not been able to modify struct bbox_t .. the code is crashing. I think there is more to it.

AlexeyAB commented 6 years ago

I have not been able to modify struct bbox_t .. the code is crashing. I think there is more to it.

After changing struct bbox_t you should recompile (DLL/SO-library) yolo_cpp_dll.sln and then recompile your soft.

So just try to move most of your own code from draw_boxes() to this place - before this line: https://github.com/AlexeyAB/darknet/blob/eff487ba3626a39e135d13929117e04bc4cf5823/src/yolo_console_dll.cpp#L360

Add to the struct bbox_t field int face_id and recompile DLL-library. Then set face_id = matchingBestIndex; for each corresponding face, or if isn't recognized then set face_id=-1

Then in the draw_boxes() just print face-id number for each found face, just add after this line: https://github.com/AlexeyAB/darknet/blob/eff487ba3626a39e135d13929117e04bc4cf5823/src/yolo_console_dll.cpp#L191

obj_name += " - " + std::to_string(i.face_id);

dexception commented 6 years ago

@AlexeyAB

After adding the code at

https://github.com/AlexeyAB/darknet/blob/eff487ba3626a39e135d13929117e04bc4cf5823/src/yolo_console_dll.cpp#L360

the matchingBestIndex is still -1 inside the draw_boxes method. But if i add the below code at

https://github.com/AlexeyAB/darknet/blob/eff487ba3626a39e135d13929117e04bc4cf5823/src/yolo_console_dll.cpp#L373

the matchingBestIndex is getting updated.

Is it possible to call the doFaceRecognition method only when a new track_id is generated at the beginning and not at every object detection and if the doFaceRecognition already has a matchingBestIndex it need not update it ?

for (auto &i : result_vec)
{
    int _matchingBestIndex = doFaceRecognition(cur_frame, i);
    if (_matchingBestIndex != -1)
    {
        i.matchingBestIndex = _matchingBestIndex;
    }
}
void draw_boxes(cv::Mat mat_img, std::vector<bbox_t> result_vec, std::vector<std::string> obj_names,
    int current_det_fps = -1, int current_cap_fps = -1)
{
    int const colors[6][3] = { { 1,0,1 },{ 0,0,1 },{ 0,1,1 },{ 0,1,0 },{ 1,1,0 },{ 1,0,0 } };

    for (auto &i : result_vec)
    {
        cv::Scalar color = obj_id_to_color(i.obj_id);

        cv::Rect r(i.x, i.y, i.w, i.h);
        cv::rectangle(mat_img, r, color, 2);

        if (obj_names.size() > i.obj_id)
        {
            std::string obj_name = obj_names[i.obj_id];
            if (i.track_id > 0)
            {
                if (i.matchingBestIndex > 0)
                {
                    obj_name = file_list[i.matchingBestIndex];
                }
                else
                {
                    obj_name += " - " + std::to_string(i.track_id);
                }
            }

            cv::Size const text_size = getTextSize(obj_name, cv::FONT_HERSHEY_COMPLEX_SMALL, 1.2, 2, 0);

            int const max_width = (text_size.width > i.w + 2) ? text_size.width : (i.w + 2);
            cv::rectangle(mat_img, cv::Point2f(std::max((int)i.x - 1, 0), std::max((int)i.y - 30, 0)),
                cv::Point2f(std::min((int)i.x + max_width, mat_img.cols - 1), std::min((int)i.y, mat_img.rows - 1)),
                color, CV_FILLED, 8, 0);
            putText(mat_img, obj_name, cv::Point2f(i.x, i.y - 10), cv::FONT_HERSHEY_COMPLEX_SMALL, 1.2, cv::Scalar(0, 0, 0), 2);
        }
    }
    if (current_det_fps >= 0 && current_cap_fps >= 0) {
        std::string fps_str = "FPS detection: " + std::to_string(current_det_fps) + "   FPS capture: " + std::to_string(current_cap_fps);
        putText(mat_img, fps_str, cv::Point2f(10, 20), cv::FONT_HERSHEY_COMPLEX_SMALL, 1.2, cv::Scalar(50, 255, 0), 2);
    }
}
AlexeyAB commented 6 years ago

Is it possible to call the doFaceRecognition method only when a new track_id is generated at the beginning and not at every object detection and if the doFaceRecognition already has a matchingBestIndex it need not update it ?

Yes, just use:

for (auto &i : result_vec)
{
    // if isn't tracked yet
    if(i.track_id == 0) {
    int _matchingBestIndex = doFaceRecognition(cur_frame, i);
    if (_matchingBestIndex != -1)
    {
        i.matchingBestIndex = _matchingBestIndex;
    }
    }
}
dexception commented 6 years ago

@AlexeyAB Thanks !

The values are not getting updated after adding the code at this location. https://github.com/AlexeyAB/darknet/blob/eff487ba3626a39e135d13929117e04bc4cf5823/src/yolo_console_dll.cpp#L360

dexception commented 6 years ago

@AlexeyAB

Can you plz share the following ?

After how many frames are you detecting objects ? After how many frames are you updating the tracker ? Would a correlation tracker be better suited for the same ?

Thanks.

AlexeyAB commented 6 years ago

@dexception

After how many frames are you detecting objects ?

For the yolov3.cfg: yolo_console_dll.exe data/coco.names yolov3.cfg yolov3.weights test.mp4 on GPU GeForce GTX 970 - the Yolo Detector will detect for each 6th frame. 49 FPS capture / 8 FPS detection = 6

image


After how many frames are you updating the tracker ?

Tracker do tracking for each frame.

After that Detector have gave us results, then we update Tracker - asynchronously overtaking up to the current frame (we send to the tracker all the saved frames starting from the frame where the detection began). And then we do tracking for each current frame, until Detector will give us new results.


Would a correlation tracker be better suited for the same ?

What tracker do you want to use?

Current OpenCV Optical flow tracker is very fast multi-object tracker - up to ~1000 FPS on GTX 970.

dexception commented 6 years ago

@AlexeyAB I am using Dlib correlation tracker + dlib cnn face detector + dlib face landmark. And current doing face recognition+adding new trackers for new objects every 10th frame. After every 4th frame updating the trackers. And drawing trackers at each frame.

The face detector i trained with yolov3 and combined it with dlib landmark didn't work. It was fast but the problem with dlib is the face coordinates are little lower near to eye brows and i trained my face detector on celebA dataset. So the landmark comparison for distance was giving lot of false positives.

AlexeyAB commented 6 years ago

@dexception As I see this is single-object tracker void start_track (const image_type& img, const drectangle& p);? http://dlib.net/dlib/image_processing/correlation_tracker_abstract.h.html Do you track only 1 face? And how fast does it work?

And current doing face recognition+adding new trackers for new objects every 10th frame. After every 4th frame updating the trackers.

Did you hardcoded these values (10, 4) or does it depend on GPU/algorithms performance?

dexception commented 6 years ago

@AlexeyAB I have modified the code to track multiple faces.

std::vector trackers.

Yes hardcoded these values for testing. But will change according to fps i am getting.

25 FPS with 1 GB GPU memory and so far getting more accuracy than my earlier model.

dexception commented 6 years ago

@AlexeyAB Have you compared different posing algorithms ?

AlexeyAB commented 6 years ago

@dexception What is the posing algorithms?

I only compared detection algorithms. But I didn't search a long time for tracking algorithms, I just compared several features from main OpenCV repo: Feature+findHomography/estimateRigid, CamShift/MeanShift, Phase Correlation, Optical Flow,... so Optical Flow good and fast multi-object tracker.

deimsdeutsch commented 6 years ago

@AlexeyAB Which method is deciding whether new trackers are needed and which trackers need to be erased ? I have not been able to search it in the code.

aimhabo commented 6 years ago

@AlexeyAB You can try cv::Ptr<cv::Tracker> tracker= cv::TrackerTLD::create(/* "TLD"*/); from #include <opencv2/tracking.hpp>, but the problem is that TLD algorithm is completely encapsulated in OpenCV, already included target detection. My previous attempt to make darknet cpp-ization failed, I dont know how to use yolov3+TLD

On the other hand, deep_sort is a good algorithm too. But also has a problem that, deep_sort using a private DeepLearning-Feature( he said it generated from MOT dataset, and just given a uncertain network design in paper ) Although he does a good job of fellowing, but the matching for missing targets has a high rate of false positives (though it is better than TLD and KCF).

dexception commented 6 years ago

@TaihuLight The code has been modified heavily now. What exactly are you looking for ?

dexception commented 6 years ago

@AlexeyAB Just saw one of the trackers with face id 3 shift from one person to another person's face. I think there is some issue with the trackers. I can confirm this as a bug.

AlexeyAB commented 6 years ago

@TaihuLight

Is it stable and does it work fast for 10 - 100 objects?

AlexeyAB commented 6 years ago

@dexception

Just saw one of the trackers with face id 3 shift from one person to another person's face. I think there is some issue with the trackers. I can confirm this as a bug.

Is it a bug, or just a low precision of tracking?

dexception commented 6 years ago

@AlexeyAB There are 2 issues i have noticed.

  1. Even after the face disappears from the frame and someone else comes from the opposite direction the same face id is given to that person which is going to generate false counting.
  2. When the 2 faces get overlapped sometimes the face is is swapped.

I would rate this as a certain bug.

The correlation tracker from DLIB does not suffer from these issues.

    //calculate the center position of the object
    int x_bar = x + 0.5 * w;
    int y_bar = y + 0.5 * h;

    for (int i = 0; i != trackers.size(); i++)
    {
        drectangle tracked_position = trackers[i].get_position();

        int t_x = (int)tracked_position.left();
        int t_y = (int)tracked_position.top();
        int t_w = (int)tracked_position.width();
        int t_h = (int)tracked_position.height();

        //calculate the center position of the tracking object
        int t_x_bar = t_x + 0.5 * t_w;
        int t_y_bar = t_y + 0.5 * t_h;

        if ((t_x <= x_bar <= (t_x + t_w)) && (t_y <= y_bar <= (t_y + t_h)) &&
            (x <= t_x_bar <= (x + w)) && (y <= t_y_bar <= (y + h)))
        {
            isNewTrackerNeeded = 0;
        }
        else
        {
            //create new tracker as this is a new object not being tracked
            isNewTrackerNeeded = 1;
            break;
        }
    }

For erasing trackers:

        double tracking_quality = trackers[i].update(*dlibImageGray);

        if (tracking_quality < 7)
        {
            delete_list.push_back(i);
        }
AlexeyAB commented 6 years ago

@dexception

  1. May be it can be solved by changing these 2 lines: https://github.com/AlexeyAB/darknet/blob/b847f39f60eb6715325f3707e78667a0611811dd/src/yolo_v2_class.hpp#L73-L74 to these

    YOLODLL_API std::vector<bbox_t> tracking_id(std::vector<bbox_t> cur_bbox_vec, 
    bool const change_history = true, int const frames_story = 1, int const max_dist = 30);
  2. I think it can't cases with occlusions can't be solved by using Optical Flow. So there should be used other trackers: correlation, TLD, KFC, ...


May be later I will add correlation_tracker from DLIB to my example: http://dlib.net/dlib/image_processing/correlation_tracker_abstract.h.html

Or cv::TrackerTLD from OpenCV-contrib: https://github.com/opencv/opencv_contrib/blob/master/modules/tracking/include/opencv2/tracking/tracker.hpp#L1181 https://github.com/opencv/opencv_contrib/blob/master/modules/tracking/src/multiTracker.hpp#L52-L53

TaihuLight commented 6 years ago

@dexception Your following code is private, then your permission is need for download? I have send request to me. version2 https://drive.google.com/open?id=1oj8-GdJSSqF32A-nZbOQn-cMr3FHJOPY version3 https://drive.google.com/file/d/15IDALtHAx2wAuu3RY1AinvSznLibBzVb @AlexeyAB I don't get the code for object tracking, and don't understand how to run it .

uday60 commented 6 years ago

Dear Alex,

I just tested the face detection and your tracker on a live crowd using a hikvision ptz camera. It was so bad that I had to shift to stream 3 and that too configured it on low resolution with 8 fps and it still crashed because of delay. Looks like I would need a huge GPU to do real-time HD video and 4k 60 meter face recognition is totally impractical. The cost is going to be huge.

I had trained my model to detect faces and it was detecting faces but the fps was so low on a live video.

dexception commented 6 years ago

This should help:

https://github.com/rafaelpadilla/Object-Detection-Metrics

AlexeyAB commented 6 years ago

@uday60

dexception commented 6 years ago

@AlexeyAB I have been wondering this for a while now. I need your expert advice on this.

This guy is doing object detection on integers. What algorithm is this ? The only thing that comes close to what i can think of is LBPH(Local Binary Pattern with histogram) in opencv. But it looks way too much optimized while i was running his demo it was showing 12 threads but when i played a normal video in opencv it was consuming 30 threads. So it is even opencv or some other computer vision library.

https://github.com/ShiqiYu/libfacedetection

AlexeyAB commented 6 years ago

@dexception

This guy is doing object detection on integers.

How did you find out that it uses integers?

I don't know what is this algorithms, it isn't published. This isn't Neural Network, because it requires sliding window (params: scale, window size). Yes, it is very optimized.

Shenzhen Uni by Shiqi Yu - it isn't very accurate, but it is very fast 1500 FPS on multicore-CPU: http://vis-www.cs.umass.edu/fddb/results.html#rocunpub

dexception commented 6 years ago

@AlexeyAB Installed GPU-Z on the system ran his code and saw the results. LBP works on integers and is not the most accurate but is the fastest object detection in the world on CPU.

I ran the demo on grayscale images and was decent enough to detect front faces. I will train my own faces and come back with results. This is truely scalable. You can run 10 of these in your normal i7 laptop without GPU. Imagine how many streams you can handle on a 80 core server.

https://github.com/opencv/opencv/tree/master/data/lbpcascades

AlexeyAB commented 6 years ago

@dexception

It seems this is the optimal algorithm for this particular task - Face Detection. Did you find any Face Recognition algorithm that is faster/accurate than Dlib?

You can run 10 of these in your normal i7 laptop without GPU. Imagine how many streams you can handle on a 80 core server.

https://github.com/opencv/opencv/tree/master/data/lbpcascades

Do you mean that this library https://github.com/ShiqiYu/libfacedetection uses default LBP-models from opencv https://github.com/opencv/opencv/tree/master/data/lbpcascades ?

dexception commented 6 years ago

@AlexeyAB
Chinese Whispers algorithm is faster than the default code i shared earlier.

This is the default implementation of LBP in opencv. https://github.com/opencv/opencv/tree/master/data/lbpcascades ?

No he does not use the default version. https://github.com/ShiqiYu/libfacedetection

Isha8 commented 6 years ago

@dexception I am trying to do a similar thing for barcode recognition instead of face but having issues in linking that library(ZBar) to darknet. Could you please tell how you linked Dlib to darknet? Thanks

dexception commented 6 years ago

@Isha8

Convert the opencv rectangle to dlib rectangle:

static dlib::rectangle openCVRectToDlib(cv::Rect r) { return dlib::rectangle((long)r.tl().x, (long)r.tl().y, (long)r.br().x - 1, (long)r.br().y - 1); }

@AlexeyAB This is a faster version of the queue implementation that is currently implemented in the tracker. https://github.com/cameron314/concurrentqueue

Isha8 commented 6 years ago

@dexception I meant how you were able to use the dlib resources within yolo. I am not able to make darknet, what changes did you do inside the makefile?

Isha8 commented 6 years ago

It works now, I had added to the compiler flags instead of linker. Thanks.