batch_face_locations performance using GPU with large datasets

fottofatto commented 5 years ago

System properties

Memory: 32 GB
CPU: Intel(R) Xeon(R) W-2123 CPU @ 3.60GHz
OS: Ubuntu 16.04.5 LTS 64-Bit
GPU: GTX1080 Ti 11 GB
NVIDIA-SMI 410.48
Driver Version: 410.48
Dlib Version: 19.15.0
Face Recognition Version: 1.2.3
Photo size: 240x320

Description

I have bunch of photos which I want to get the encodings and to find distance with new photos. I am trying to use the face detection feature with batch functions. Using the cnn model and gpu it took 32 seconds to get the face detection of 4096 photos which corresponds to 128 photos per second and I am sure dlib is using cuda and gpu. Nvidia-smi output is below and dlib.DLIB_USE_CUDA outputs True. I have slightly removed the hog parts to simplfy it. Batch size is 32. If I changed the batch size up to 128, there is almost no change. The only change is the gpu memory usage. If I increase the batch size gpu memory usage increases.

If I run multiple instance of this code with different photo directory simultaneously let's say five copies (I mean five different processes on different tabs), the number of photos detected per second may scale up to almost 300 photos.

So, my question is that is it the maximum performance (128 photos per second for 240x320 sized photos with this setup) I can get? How can I maximize my performance up to 300 photos per second with only running one process? In my opinion somehow I should reach that performance with only running one instance, but I don't know how to do that. Which parameters should I change?

What I Did

Here is my code:

import dlib
import numpy as np
from timeit import default_timer as timer
import face_recognition_models
import dlib.cuda as cuda
from scipy.misc import imread
import os

cuda.set_device(0)
face_detector = dlib.get_frontal_face_detector()

predictor_68_point_model = face_recognition_models.pose_predictor_model_location()
pose_predictor_68_point = dlib.shape_predictor(predictor_68_point_model)

cnn_face_detection_model = face_recognition_models.cnn_face_detector_model_location()
cnn_face_detector = dlib.cnn_face_detection_model_v1(cnn_face_detection_model)

face_recognition_model = face_recognition_models.face_recognition_model_location()
face_encoder = dlib.face_recognition_model_v1(face_recognition_model)

def _rect_to_css(rect):
    return rect.top(), rect.right(), rect.bottom(), rect.left()

def _css_to_rect(css):
    return dlib.rectangle(css[3], css[0], css[1], css[2])

def _trim_css_to_bounds(css, image_shape):
    return max(css[0], 0), min(css[1], image_shape[1]), min(css[2], image_shape[0]), max(css[3], 0)

def face_locations(img, number_of_times_to_upsample=0):
    return [_trim_css_to_bounds(_rect_to_css(face.rect), img.shape) for face in cnn_face_detector(img, number_of_times_to_upsample, batch_size=32)]

def face_landmarks(face_image, face_locations=None):
    if face_locations is None:
        face_locations = face_locations(face_image)
    else:
        face_locations = [_css_to_rect(face_location) for face_location in face_locations]

    pose_predictor = pose_predictor_68_point
    return [pose_predictor(face_image, face_location) for face_location in face_locations]

def face_encodings(face_image, known_face_locations=None, num_jitters=1):
    raw_landmarks = face_landmarks(face_image, known_face_locations)
    return [np.array(face_encoder.compute_face_descriptor(face_image, raw_landmark_set, num_jitters)) for raw_landmark_set in raw_landmarks]

def _raw_face_locations_batched(images, number_of_times_to_upsample=1, batch_size=32):
    x = cnn_face_detector(images, number_of_times_to_upsample, batch_size=batch_size)
    return x

def batch_face_locations(images, number_of_times_to_upsample=1, batch_size=32):
    def convert_cnn_detections_to_css(detections):
        return [_trim_css_to_bounds(_rect_to_css(face.rect), images[0].shape) for face in detections]

    raw_detections_batched = _raw_face_locations_batched(images, number_of_times_to_upsample, batch_size)
    return list(map(convert_cnn_detections_to_css, raw_detections_batched))

files = os.listdir('/home/photos/')
images = [imread('/home/photos/' + i) for i in files]
images_array = np.array(images)
images_list = list(images_array)

start = timer()
locations = batch_face_locations(images_list,number_of_times_to_upsample=1,batch_size=32)

elapsed_time = timer() - start
print("Face detection took %f seconds " % elapsed_time)

print(dlib.DLIB_USE_CUDA)

Code Output:

Face detection took 31.295752 seconds 
True

nvidia-smi Output:

nvidia-smi-1

nvidia-smi-2

It starts with 195 MB memory then it reaches to 2005 MB and then it finishes.

smiledfox commented 5 years ago

I think you need an SSD.

fottofatto commented 5 years ago

I think you need an SSD.

Do you think that there is a disk bound? Does it might be related to dlib? What you say is that If I get SSD, I got performance of multiple processes by only running one, right? Because what I want is maximum utilization. Thanks for your help.

smiledfox commented 5 years ago

The small files IO performance is ten times faster with SDD. You can check the iowait on "top" command, must sure that is not too high.

ageitgey / face_recognition