Weird error: "RuntimeError: Error while calling cudnnConvolutionForward ... code: 7, reason: A call to cuDNN failed"

congphase commented 4 years ago

I tried another face detection method which returns bounding boxes values of float type. Then I convert them with int() and feed them to face_recognition.face_encodings() (which I assign as get_face_encoding()). Then I get error detailed as below:

Traceback (most recent call last): File "predict_tensorrt_video.py", line 665, in main() File "predict_tensorrt_video.py", line 88, in inner retval = fnc(*args, **kwargs) File "predict_tensorrt_video.py", line 659, in main run_inference(args.video_in, args.video_out, candidate_id, current_time) File "predict_tensorrt_video.py", line 549, in run_inference face_encoding = get_face_encodings(frame, css_type_face_location, 0)[0] File "/home/gate/.virtualenvs/lffd/lib/python3.6/site-packages/face_recognition/api.py", line 210, in face_encodings return [np.array(face_encoder.compute_face_descriptor(face_image, raw_landmark_set, num_jitters)) for raw_landmark_set in raw_landmarks] File "/home/gate/.virtualenvs/lffd/lib/python3.6/site-packages/face_recognition/api.py", line 210, in return [np.array(face_encoder.compute_face_descriptor(face_image, raw_landmark_set, num_jitters)) for raw_landmark_set in raw_landmarks] RuntimeError: Error while calling cudnnConvolutionForward( context(), &alpha, descriptor(data), data.device(), (const cudnnFilterDescriptor_t)filter_handle, filters.device(), (const cudnnConvolutionDescriptor_t)conv_handle, (cudnnConvolutionFwdAlgo_t)forward_algo, forward_workspace, forward_workspace_size_in_bytes, &beta, descriptor(output), output.device()) in file /home/gate/dlib-19.17/dlib/cuda/cudnn_dlibapi.cpp:1007. code: 7, reason: A call to cuDNN failed cudaStreamDestroy() failed. Reason: invalid device ordinal cudaFree() failed. Reason: invalid device pointer cudaFreeHost() failed. Reason: invalid argument cudaStreamDestroy() failed. Reason: unknown error cudaFree() failed. Reason: invalid device pointer cudaFreeHost() failed. Reason: invalid argument cudaFree() failed. Reason: invalid device pointer Segmentation fault (core dumped)

Anyone knows how to solve it? I've struggled with it all day. Thank you a lot!!

reddytocode commented 4 years ago

you say that you're passing your bounding boxes from another detector right? is that detector using the gpu? can you share the lines of how you're passing your bounding boxes?

congphase commented 4 years ago

@Reddyforcode Thank you for your attention,

I used the detector called LFFD with the sample code at link, and it used GPU. I modified it to do inference on my video. It runs the detection properly, but when the first detection is used for getting encoding, that error occurs.

Here's how I passed each bbox:

bbox = bboxes[0] bb_conf = bbox[4] bb_left = bbox[0] bb_top = bbox[1] bb_right = bbox[2] bb_bottom = bbox[3] # those are all floats

bb_left = int(bb_left) bb_top = int(bb_top) bb_right = int(bb_right) bb_bottom = int(bb_bottom) # converted to int to be used in face_recognition

css_type_face_location = [(bb_top, bb_right, bb_bottom, bb_left)]

face_encoding = get_face_encodings(frame, css_type_face_location, 0)[0] # at this line that error occurs

I first suspected that it is because I do inference on a pre-set smaller region of the original frame size (1280, 720). This is how I did it:

# the pre-set ROI region SPREAD_ROI = (358, 277, 784, 568) # the image cropped with the region frame_spread_roi = frame[SPREAD_ROI[1]:SPREAD_ROI[3], SPREAD_ROI[0]:SPREAD_ROI[2]]

But when tried reset that region to be equal to the original one to apply inference, it's still that error. But there's a special point, the error output is all of what I mentioned, but misses only these lines:

cudaFree() failed. Reason: invalid device pointer cudaFreeHost() failed. Reason: invalid argument cudaStreamDestroy() failed. Reason: unknown error cudaFree() failed. Reason: invalid device pointer cudaFreeHost() failed. Reason: invalid argument cudaFree() failed. Reason: invalid device pointer

(The line "Segmentation fault (core dumped)" is still there). I don't know why.

reddytocode commented 4 years ago

It says Core Dumped it means that you are out of memory in your gpu try executing in the command line

nvidia-smi

it will give you the usage of your gpu memory.

I guess you joined the face_recognition and your external code for face detection the wrong way. possible error without seeing your complete project:

you don't initialize just once this object Inference_TensorRT
you are not passing the bbox to face_recognition correctly

congphase commented 4 years ago

@Reddyforcode I have monitored the memory with jtop when the program started to run till it ended. And memory never reached to the maximum, it only consumed around 2.5GB/4GB. The GPU reached to the maximum 99% and turned back to 0% periodically.

By the way, I used Jetson Nano, which is advised to have a line of dlib's source code changed before compiling to avoid a bug, mentioned by Adam Geitgey in link (search for 'line 854' for quick reference). That file that was changed is the file that the python interpreter prompted this weird error. I'm confused, because if the problem is from my dlib api files, I wouldn't have had done face_recognition_face_encodings successfully (I've done it before), it must be caused by my passing bbox to the face_encodings(), but I've checked several times, it's still not the point.

I have tried to re-check my code up to your 2 suggestions but it nothing seems to work :( Here's my code:

#################################################################
######################### RELEASE NOTES #########################
#################################################################
'''
 - Process order:
    RFID scanned --> This script run --> Verify the scanned RFID number against current face
        This order, compared to face-first_RFID-later or face-only, provides a more balanced approach,
        which is a compromise between demanding actual use and the complex generalization of the algorithm
 - How to run:
    python DO_FACE_VERIFICATION.py --video_in <> --video_out <> --candidate <>

 - SPREAD_ROI is used, detection speed is reduced to around 90ms/frame, but sacrificing competency
 -
'''

import argparse
import sys
#import dlib
import time
import cv2
import os
import numpy

import face_recognition

import api_dirs
import freq_cv
import rtsp_cam
import tnt_info

# from time import gmtime, strftime
from datetime import datetime
from datetime import date

import logging
import pycuda.driver as cuda
import pycuda.autoinit
import tensorrt as trt

####################################
## WHERE THE DEFINITIONS ARE LAID ##
####################################

# ROI FOR REDUCING COMPUTATION
SPREAD_ROI = rtsp_cam.SPREAD_ROI

# Messages
MSG_NHIN_TRUC_TIEP_VAO_CAM = "MSG_HAY_NHIN_TRUC_TIEP_VAO_CAMERA"
MSG_XAC_THUC_KHUON_MAT_THANH_CONG = "MSG_XAC_THUC_KHUON_MAT_THANH_CONG"
MSG_XAC_THUC_KHUON_MAT_KHONG_THANH_CONG = "MSG_XAC_THUC_KHUON_MAT_KHONG_THANH_CONG"
MSG_KHONG_TIM_THAY_KHUON_MAT_NAO = "MSG_KHONG_TIM_THAY_KHUON_MAT_NAO"
MSG_HAY_QUET_LAI_THE_VA_NHIN_THANG_VAO_CAMERA = "MSG_HAY_QUET_LAI_THE_VA_NHIN_THANG_VAO_CAMERA"

# VARIABLES

# The cnn version of flib face detection
#cnn_detector = dlib.cnn_face_detection_model_v1(api_dirs.face_detection_model_cnn)

# Window names
CAP_WINDOW_NAME = 'CameraDemo'
VERIF_WINDOW_NAME = 'Verification Window'

def parse_args():
    # Parse input arguments
    desc = 'Do face verification on video'
    parser = argparse.ArgumentParser(description=desc)
    parser.add_argument("--video_in", dest='video_in', required=True,
                        help='Video input')
    parser.add_argument("--video_out", dest='video_out', required=True,
                        help='Video output')
    parser.add_argument("--candidate_id", dest='candidate_id', required=True,
                        type=int,
                        help="Candidate index: [1, 2, 3, ...]")
    args = parser.parse_args()
    return args

import cProfile, pstats, io

def profile(fnc):
    """A decorator that uses cProfile to profile a function"""

    def inner(*args, **kwargs):
        pr = cProfile.Profile()
        pr.enable()
        retval = fnc(*args, **kwargs)
        pr.disable()
        s = io.StringIO()
        sortby = 'cumulative'
        ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
        ps.print_stats()
        print(s.getvalue())
        return retval

    return inner

logging.getLogger().setLevel(logging.DEBUG)

def NMS(boxes, overlap_threshold):
    '''

    :param boxes: numpy nx5, n is the number of boxes, 0:4->x1, y1, x2, y2, 4->score
    :param overlap_threshold:
    :return:
    '''
    if boxes.shape[0] == 0:
        return boxes

    # if the bounding boxes integers, convert them to floats --
    # this is important since we'll be doing a bunch of divisions
    if boxes.dtype != numpy.float32:
        boxes = boxes.astype(numpy.float32)

    # initialize the list of picked indexes
    pick = []
    # grab the coordinates of the bounding boxes
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    sc = boxes[:, 4]
    widths = x2 - x1
    heights = y2 - y1

    # compute the area of the bounding boxes and sort the bounding
    # boxes by the bottom-right y-coordinate of the bounding box
    area = heights * widths
    idxs = numpy.argsort(sc)

    # keep looping while some indexes still remain in the indexes list
    while len(idxs) > 0:
        # grab the last index in the indexes list and add the
        # index value to the list of picked indexes
        last = len(idxs) - 1
        i = idxs[last]
        pick.append(i)

        # compare secend highest score boxes
        xx1 = numpy.maximum(x1[i], x1[idxs[:last]])
        yy1 = numpy.maximum(y1[i], y1[idxs[:last]])
        xx2 = numpy.minimum(x2[i], x2[idxs[:last]])
        yy2 = numpy.minimum(y2[i], y2[idxs[:last]])

        # compute the width and height of the box
        w = numpy.maximum(0, xx2 - xx1 + 1)
        h = numpy.maximum(0, yy2 - yy1 + 1)

        # compute the ratio of overlap
        overlap = (w * h) / area[idxs[:last]]

        # delete all indexes from the index list that have
        idxs = numpy.delete(idxs, numpy.concatenate(([last], numpy.where(overlap > overlap_threshold)[0])))

    # return only the bounding boxes that were picked using the
    # integer data type
    return boxes[pick]

# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

class Inference_TensorRT:
    def __init__(self, onnx_file_path,
                 receptive_field_list,
                 receptive_field_stride,
                 bbox_small_list,
                 bbox_large_list,
                 receptive_field_center_start,
                 num_output_scales):

        temp_trt_file = os.path.join('trt_file_cache/', os.path.basename(onnx_file_path).replace('.onnx', '.trt'))

        load_trt_flag = False
        if not os.path.exists(temp_trt_file):
            if not os.path.exists(onnx_file_path):
                logging.error('ONNX file does not exist!')
                sys.exit(1)
            logging.info('Init engine from ONNX file.')
        else:
            load_trt_flag = True
            logging.info('Init engine from serialized engine.')

        self.receptive_field_list = receptive_field_list
        self.receptive_field_stride = receptive_field_stride
        self.bbox_small_list = bbox_small_list
        self.bbox_large_list = bbox_large_list
        self.receptive_field_center_start = receptive_field_center_start
        self.num_output_scales = num_output_scales
        self.constant = [i / 2.0 for i in self.receptive_field_list]

        # init log
        TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
        self.engine = None
        if load_trt_flag:
            with open(temp_trt_file, 'rb') as fin, trt.Runtime(TRT_LOGGER) as runtime:
                self.engine = runtime.deserialize_cuda_engine(fin.read())
        else:
            # declare builder object
            logging.info('Create TensorRT builder.')
            builder = trt.Builder(TRT_LOGGER)

            # get network object via builder
            logging.info('Create TensorRT network.')
            network = builder.create_network()

            # create ONNX parser object
            logging.info('Create TensorRT ONNX parser.')
            parser = trt.OnnxParser(network, TRT_LOGGER)

            with open(onnx_file_path, 'rb') as onnx_fin:
                parser.parse(onnx_fin.read())

            # print possible errors
            num_error = parser.num_errors
            if num_error != 0:
                logging.error('Errors occur while parsing the ONNX file!')
                for i in range(num_error):
                    temp_error = parser.get_error(i)
                    print(temp_error.desc())
                sys.exit(1)

            # create engine via builder
            builder.max_batch_size = 1
            builder.average_find_iterations = 2
            logging.info('Create TensorRT engine...')
            engine = builder.build_cuda_engine(network)

            # serialize engine
            if not os.path.exists('trt_file_cache/'):
                os.makedirs('trt_file_cache/')
            logging.info('Serialize the engine for fast init.')
            with open(os.path.join('trt_file_cache/', os.path.basename(onnx_file_path).replace('.onnx', '.trt')), 'wb') as fout:
                fout.write(engine.serialize())
            self.engine = engine

        self.output_shapes = []
        self.input_shapes = []
        for binding in self.engine:
            if self.engine.binding_is_input(binding):
                self.input_shapes.append(tuple([self.engine.max_batch_size] + list(self.engine.get_binding_shape(binding))))
            else:
                self.output_shapes.append(tuple([self.engine.max_batch_size] + list(self.engine.get_binding_shape(binding))))
        if len(self.input_shapes) != 1:
            logging.error('Only one input data is supported.')
            sys.exit(1)
        self.input_shape = self.input_shapes[0]
        logging.info('The required input size: %d, %d, %d' % (self.input_shape[2], self.input_shape[3], self.input_shape[1]))

        # create executor
        self.executor = self.engine.create_execution_context()
        self.inputs, self.outputs, self.bindings = self.__allocate_buffers(self.engine)

    def __allocate_buffers(self, engine):
        inputs = []
        outputs = []
        bindings = []
        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings

    def do_inference(self, image, score_threshold=0.4, top_k=10000, NMS_threshold=0.4, NMS_flag=True, skip_scale_branch_list=[]):

        if image.ndim != 3 or image.shape[2] != 3:
            print('Only RGB images are supported.')
            return None
        input_height = self.input_shape[2]
        input_width = self.input_shape[3]
        if image.shape[0] != input_height or image.shape[1] != input_width:
            logging.info('The size of input image is not %dx%d.\nThe input image will be resized keeping the aspect ratio.' % (input_height, input_width))

        input_batch = numpy.zeros((1, input_height, input_width, self.input_shape[1]), dtype=numpy.float32)
        left_pad = 0
        top_pad = 0
        if image.shape[0] / image.shape[1] > input_height / input_width:
            resize_scale = input_height / image.shape[0]
            input_image = cv2.resize(image, (0, 0), fx=resize_scale, fy=resize_scale)
            left_pad = int((input_width - input_image.shape[1]) / 2)
            input_batch[0, :, left_pad:left_pad + input_image.shape[1], :] = input_image
        else:
            resize_scale = input_width / image.shape[1]
            input_image = cv2.resize(image, (0, 0), fx=resize_scale, fy=resize_scale)
            top_pad = int((input_height - input_image.shape[0]) / 2)
            input_batch[0, top_pad:top_pad + input_image.shape[0], :, :] = input_image

        input_batch = input_batch.transpose([0, 3, 1, 2])
        input_batch = numpy.array(input_batch, dtype=numpy.float32, order='C')
        self.inputs[0].host = input_batch

        [cuda.memcpy_htod(inp.device, inp.host) for inp in self.inputs]
        self.executor.execute(batch_size=self.engine.max_batch_size, bindings=self.bindings)
        [cuda.memcpy_dtoh(output.host, output.device) for output in self.outputs]
        outputs = [out.host for out in self.outputs]
        outputs = [numpy.squeeze(output.reshape(shape)) for output, shape in zip(outputs, self.output_shapes)]

        bbox_collection = []
        for i in range(self.num_output_scales):
            if i in skip_scale_branch_list:
                continue

            score_map = numpy.squeeze(outputs[i * 2])

            # show feature maps-------------------------------
            # score_map_show = score_map * 255
            # score_map_show[score_map_show < 0] = 0
            # score_map_show[score_map_show > 255] = 255
            # cv2.imshow('score_map' + str(i), cv2.resize(score_map_show.astype(dtype=numpy.uint8), (0, 0), fx=2, fy=2))
            # cv2.waitKey()

            bbox_map = numpy.squeeze(outputs[i * 2 + 1])

            RF_center_Xs = numpy.array([self.receptive_field_center_start[i] + self.receptive_field_stride[i] * x for x in range(score_map.shape[1])])
            RF_center_Xs_mat = numpy.tile(RF_center_Xs, [score_map.shape[0], 1])
            RF_center_Ys = numpy.array([self.receptive_field_center_start[i] + self.receptive_field_stride[i] * y for y in range(score_map.shape[0])])
            RF_center_Ys_mat = numpy.tile(RF_center_Ys, [score_map.shape[1], 1]).T

            x_lt_mat = RF_center_Xs_mat - bbox_map[0, :, :] * self.constant[i]
            y_lt_mat = RF_center_Ys_mat - bbox_map[1, :, :] * self.constant[i]
            x_rb_mat = RF_center_Xs_mat - bbox_map[2, :, :] * self.constant[i]
            y_rb_mat = RF_center_Ys_mat - bbox_map[3, :, :] * self.constant[i]

            x_lt_mat = x_lt_mat
            x_lt_mat[x_lt_mat < 0] = 0
            y_lt_mat = y_lt_mat
            y_lt_mat[y_lt_mat < 0] = 0
            x_rb_mat = x_rb_mat
            x_rb_mat[x_rb_mat > input_width] = input_width
            y_rb_mat = y_rb_mat
            y_rb_mat[y_rb_mat > input_height] = input_height

            select_index = numpy.where(score_map > score_threshold)
            for idx in range(select_index[0].size):
                bbox_collection.append((x_lt_mat[select_index[0][idx], select_index[1][idx]] - left_pad,
                                        y_lt_mat[select_index[0][idx], select_index[1][idx]] - top_pad,
                                        x_rb_mat[select_index[0][idx], select_index[1][idx]] - left_pad,
                                        y_rb_mat[select_index[0][idx], select_index[1][idx]] - top_pad,
                                        score_map[select_index[0][idx], select_index[1][idx]]))

        # NMS
        bbox_collection = sorted(bbox_collection, key=lambda item: item[-1], reverse=True)
        if len(bbox_collection) > top_k:
            bbox_collection = bbox_collection[0:top_k]
        bbox_collection_numpy = numpy.array(bbox_collection, dtype=numpy.float32)
        bbox_collection_numpy = bbox_collection_numpy / resize_scale

        if NMS_flag:
            final_bboxes = NMS(bbox_collection_numpy, NMS_threshold)
            final_bboxes_ = []
            for i in range(final_bboxes.shape[0]):
                final_bboxes_.append((final_bboxes[i, 0], final_bboxes[i, 1], final_bboxes[i, 2], final_bboxes[i, 3], final_bboxes[i, 4]))

            return final_bboxes_
        else:
            return bbox_collection_numpy

def draw_border(img, pt1, pt2, color, thickness, r, d):
    x1, y1 = pt1
    x2, y2 = pt2

    line = cv2.line
    ellipse = cv2.ellipse

    # Top left
    line(img, (x1 + r, y1), (x1 + r + d, y1), color, thickness)
    line(img, (x1, y1 + r), (x1, y1 + r + d), color, thickness)
    ellipse(img, (x1 + r, y1 + r), (r, r), 180, 0, 90, color, thickness)

    # Top right
    line(img, (x2 - r, y1), (x2 - r - d, y1), color, thickness)
    line(img, (x2, y1 + r), (x2, y1 + r + d), color, thickness)
    ellipse(img, (x2 - r, y1 + r), (r, r), 270, 0, 90, color, thickness)

    # Bottom left
    line(img, (x1 + r, y2), (x1 + r + d, y2), color, thickness)
    line(img, (x1, y2 - r), (x1, y2 - r - d), color, thickness)
    ellipse(img, (x1 + r, y2 - r), (r, r), 90, 0, 90, color, thickness)

    # Bottom right
    line(img, (x2 - r, y2), (x2 - r - d, y2), color, thickness)
    line(img, (x2, y2 - r), (x2, y2 - r - d), color, thickness)
    ellipse(img, (x2 - r, y2 - r), (r, r), 0, 0, 90, color, thickness)

def run_inference(video_in, video_out, candidate_id, current_time):
    """
    :param video_in: input video needed to be processed, or, rtsp video stream feed
    :param video_out: output video of the process, used for re-checking
    :param candidate_id: id passed by RF
    :param current_time: 'd' or 'n', which is day or night -- the time this script is run
    :return: not yet known
    """

    # Initialize some frequently called methods to reduce time
    get_face_encodings = face_recognition.face_encodings
    compare_faces = face_recognition.compare_faces
    imshow = cv2.imshow
    rectangle = cv2.rectangle
    cvtColor = cv2.cvtColor
    COLOR_BGR2RGB = cv2.COLOR_BGR2RGB
    get_time = time.time

    GREEN = freq_cv.GREEN
    transform_coordinates = freq_cv.transform_coordinates
    get_smpl_encs = tnt_info.get_smpl_encs

    log_info = logging.info
    log_debug = logging.debug
    log_warning = logging.warning
    log_error = logging.error

    '''
    ********************************************************************************************************************
        VIDEO FILE/VIDEO STREAM HANDLING 
    ********************************************************************************************************************
    '''

    # Initialize video stuff
    fourcc = cv2.VideoWriter_fourcc(*'XVID')
    input_movie = cv2.VideoCapture(video_in)
    output_movie = cv2.VideoWriter(video_out, fourcc, 24, (rtsp_cam.WIDTH, rtsp_cam.HEIGHT))

    movie_read = input_movie.read
    movie_isOpened = input_movie.isOpened
    movie_write = output_movie.write

    '''
    ********************************************************************************************************************
        DETECTION MODEL STUFF INITIALIZATION
    ********************************************************************************************************************
    '''
    import sys
    sys.path.append('..')
    from config_farm import configuration_10_320_20L_5scales_v2 as cfg
    #from config_farm import configuration_10_560_25L_8scales_v1 as cfg

    onnx_file_path = './onnx_files/v2.onnx'
    myInference = Inference_TensorRT(
        onnx_file_path=onnx_file_path,
        receptive_field_list=cfg.param_receptive_field_list,
        receptive_field_stride=cfg.param_receptive_field_stride,
        bbox_small_list=cfg.param_bbox_small_list,
        bbox_large_list=cfg.param_bbox_large_list,
        receptive_field_center_start=cfg.param_receptive_field_center_start,
        num_output_scales=cfg.param_num_output_scales)

    do_inference = myInference.do_inference

    process_this_frame = True

    log_debug(f'spread roi used: {SPREAD_ROI}')
    # Handling the video
    while movie_isOpened():
        ret, frame = movie_read()
        # Bail out when the video file ends
        if not ret:
            log_info('video file ends')
            break

        #frame = cvtColor(frame, COLOR_BGR2RGB)
        frame_spread_roi = frame
        #frame_spread_roi = frame[SPREAD_ROI[1]:SPREAD_ROI[3], SPREAD_ROI[0]:SPREAD_ROI[2]]

        # Only process every other frame of video to save time
        if process_this_frame:
            start = get_time()
            bboxes = do_inference(frame_spread_roi, score_threshold=0.6, top_k=1000, NMS_threshold=0.2, NMS_flag=True)
            end = get_time()

            log_debug(f'detection takes: {(end-start):.3f}s')
            log_debug(f'len(bboxes) = {len(bboxes)}')

            if len(bboxes) == 0:
                log_info('detected: 0 face')

                imshow('Video', frame)
                movie_write(frame)
                continue
            elif len(bboxes) == 1:
                log_info('detected: 1 face')

                # get the (only) face location of this frame
                bbox = bboxes[0]
                bb_conf = bbox[4]
                bb_left = int(bbox[0])
                bb_top = int(bbox[1])
                bb_right = int(bbox[2])
                bb_bottom = int(bbox[3])

                log_debug(f'(conf | left, top, right, bottom) = ({bb_conf:.2f}, '
                          f'{bb_left}, {bb_top}, {bb_right}, {bb_bottom})')

                spread_roi_bb_coords = (bb_left, bb_top, bb_right, bb_bottom)

                # Convert the coordinates and update the bb values
                bb_left, bb_top, bb_right, bb_bottom = transform_coordinates(SPREAD_ROI,
                                                                             spread_roi_bb_coords)

                # if the scale used is spread-roi-scaled, anchor confidence is lower
                anchor_conf = 0.6
                if max(SPREAD_ROI) == 1280:
                    anchor_conf = 1.6

                # if detected bb has confidence lower than anchor confidence, don't do recognition
                if bb_conf < anchor_conf:
                    draw_border(frame, (bb_left, bb_top), (bb_right, bb_bottom), GREEN, 2, 5, 10)
                    imshow('Video', frame)
                    movie_write(frame)

                    log_info('confidence is low. Skipping ... ')
                    continue

                # confidence meets the requirement for doing recognition
                # css_type is needed for face_recognition.face_encodings
                css_type_face_location = [(bb_top, bb_right, bb_bottom, bb_left)]

                log_debug(f'css_type_face_location: {css_type_face_location}')

                # conversion to RGB is needed for face_recognition
                frame = cvtColor(frame, COLOR_BGR2RGB)

                '''
                import dlib
                shape_predictor = dlib.shape_predictor(api_dirs.shape_predictor_small)
                face_rec = dlib.face_recognition_model_v1(api_dirs.face_recognition)
                dlib_rectangle_face_location = dlib.rectangle(left=bb_left, top=bb_top,
                                                              right=bb_right, bottom=bb_bottom)
                shape = shape_predictor(frame, dlib_rectangle_face_location)
                face_chip = dlib.get_face_chip(frame, shape)
                start = get_time()
                face_encoding = face_rec.compute_face_descriptor(face_chip)
                end = get_time()
                log_info(f'calculating encoding takes: {(end - start):.3f}s')
                log_debug(f'face_encoding: {face_encoding}')
                log_debug(f'face_encoding type: {type(face_encoding)}')
                '''

                # get encoding for the face detected
                stat = get_time()
                face_encoding = get_face_encodings(frame, css_type_face_location, 0)[0]
                end = get_time()

                log_debug(f'calculating encoding takes: {(end-start):.4f}s')

                # See if the face is a match for the known face(s) according to candidate id
                candidate_known_face_encodings = get_smpl_encs(candidate_id, current_time)
                start = get_time()
                matches = compare_faces(candidate_known_face_encodings, face_encoding, 0.5)
                end = get_time()

                log_debug(f'comparing encodings takes: {(end-start):.4f}s')
                log_debug(f'matches: {matches}')
                name = "Unknown"

                # If num of matches is over 50%, then it's it
                log_debug(f'True/total: {matches.count(True)}/{len(matches)}')

                if matches.count(True) > (len(matches) / 2):
                    name = tnt_info.tnt_name_tup[candidate_id].split()[-1]

                # Or instead, use the known face with the smallest distance to the new face
                # face_distances = face_recognition.face_distance(candidate_known_face_encodings, face_encoding)
                # best_match_index = np.argmin(face_distances)
                # if matches[best_match_index]:
                #     name = known_face_names[best_match_index]

                # Draw a box around the face
                draw_border(frame, (bb_left, bb_top), (bb_right, bb_bottom), GREEN, 2, 5, 10)
                freq_cv.stick_name(frame, (bb_left, bb_top, bb_right, bb_bottom), name)

                # Display the resulting image
                imshow('Video', frame)
                movie_write(frame)

                # Hit 'q' on the keyboard to quit!
                if cv2.waitKey(20) & 0xFF == ord('q'):
                    break
            elif len(bboxes) > 1:
                log_info('detected: > 1 face')

                # get the left-most bbox
                log_debug(f'bboxes looks like: {bboxes}')
                log_debug('this is to know at what index x1 stays. Take a look at it')
                for i, bbox in enumerate(bboxes):
                    log_debug(f'bboxes[{i}] = {bbox}')
                    log_debug(f'spread_roi_coords: ({bbox[0]}, {bbox[1]}, {bbox[2]}, {bbox[3]})')
                    spread_roi_bb_coords = (bbox[0], bbox[1], bbox[2], bbox[3])
                    #bb_left, bb_top, bb_right, bb_bottom = transform_coordinates(SPREAD_ROI, spread_roi_bb_coords)
                    rectangle(frame, (bb_left, bb_top),
                              (bb_right, bb_bottom), GREEN, 2)

                log_debug(f'min of x1 values = {min((bboxes[0:])[0])}')

            else:
                log_error('num of faces detected < 0')
                exit()

        log_debug(f'max(frame.shape[:2]) = {max(frame.shape[:2])}')

        if max(frame.shape[:2]) > 1440:
            scale = 1440 / max(frame.shape[:2])
            frame = cv2.resize(frame, (0, 0), fx=scale, fy=scale)
        cv2.imshow('Video', frame)
        cv2.waitKey(10)

        process_this_frame = not process_this_frame

"""
def verify_face(to_be_verified, smpl, candidate_id):
    # check to determine which one is larger to be the anchor img size
"""

@profile
def main():
    args = parse_args()
    print('Called with args:')
    print(args)

    # assign the variables
    candidate_id = args.candidate_id

    # get current time when this script is called
    now = datetime.now()
    today_night_time = now.replace(hour=18, minute=0, second=0, microsecond=0)
    print("[log  ] now: {}".format(now))
    if now < today_night_time:
        current_time = 'd'
    else:
        current_time = 'n'

    # freq_cv.open_window(CAP_WINDOW_NAME, rtsp_cam.WIDTH, rtsp_cam.HEIGHT, "Captured")

    run_inference(args.video_in, args.video_out, candidate_id, current_time)

    cv2.destroyAllWindows()

if __name__ == '__main__':
    main()

Can you take a look at it or come up with any more ideas?

congphase commented 4 years ago

@Reddyforcode

I've recently narrowed down the scale of this issue. The problem is the line:

import pycuda.autoinit

What I did: I wrote a script that takes an image containing an obvious face, it has the workflow as: Performs detection, then gets its face encoding and prints that out. The script contains 2 parts: One performs the stated workflow with dlib used for detection and face_recognition (which is also dlib to some extent) used for getting the face encoding; The other part performs the same workflow with the only difference is that LFFD is used for detection. On the first run, I commented the second part, and that ugly error still showed up when it is executing face_encodings function. Next, I uncommented the second part and commented out the first, that error still showed up at that same function. I then tried commenting each of the import statements, and it turned out that import pycuda.autoinit is the case, when I commented it and part 1 ran beautifully with no errors. Having this line caused dlib's cudnnConvolutionForward() function to fail. My code (let's just look for the eye-catching "FIRST PART" and "WHERE SECOND PART ACTUALLY RUNS" only):

import dlib
import cv2
import os
import logging
import numpy
import api_dirs
import freq_cv
import pycuda.driver as cuda

import pycuda.autoinit

import tensorrt as trt
import face_recognition

import my_api

test_img = cv2.imread("/home/gate/lffd-dir/A-Light-and-Fast-Face-Detector-for-Edge-Devices/face_detection/deploy_tensorrt/high_conf_446.jpg")
get_face_encodings = face_recognition.face_encodings

######################################################################
# FIRST PART                                                         #                              
######################################################################
# dlib

cnn_detector = dlib.cnn_face_detection_model_v1("/home/gate/Downloads/mmod_human_face_detector.dat")
test_img_dlib = cv2.cvtColor(test_img, cv2.COLOR_BGR2RGB)

# just for debugging
cv2.imwrite('test_img_dlib.jpg', test_img_dlib)
face_locations = cnn_detector(test_img_dlib, 0)

print(f'[dlib] face_locations = {face_locations}')

face_location = face_locations[0]
bb_conf = face_location.confidence
bb_left = face_location.rect.left()
bb_top = face_location.rect.top()
bb_right = face_location.rect.right()
bb_bottom = face_location.rect.bottom()

print(f'[dlib] (l, t, r, b, c) = ({bb_left}, {bb_top}, {bb_right}, {bb_bottom}, {bb_conf})')

test_img_dlib = cv2.rectangle(test_img_dlib, (bb_left, bb_top), (bb_right, bb_bottom), (0, 255, 0), 2)
cv2.imwrite(f'[dlib] detected_dlib.jpg', test_img_dlib)

print(f'[dlib] detected_dlib.jpg written!')
print(f'[dlib] doing face recognition ... ')

css_type_face_location = [(bb_top, bb_right, bb_bottom, bb_left)]

print(f'[dlib] css = {css_type_face_location}')

face_encoding = get_face_encodings(test_img_dlib, css_type_face_location, 0)[0] # error showed up at this line

print(f'[dlib] face_encoding:\n{face_encoding}')

######################################################################
# SECOND PART                                                        #
######################################################################
# lffd

logging.getLogger().setLevel(logging.DEBUG)

def NMS(boxes, overlap_threshold):
    '''

    :param boxes: numpy nx5, n is the number of boxes, 0:4->x1, y1, x2, y2, 4->score
    :param overlap_threshold:
    :return:
    '''
    if boxes.shape[0] == 0:
        return boxes

    # if the bounding boxes integers, convert them to floats --
    # this is important since we'll be doing a bunch of divisions
    if boxes.dtype != numpy.float32:
        boxes = boxes.astype(numpy.float32)

    # initialize the list of picked indexes
    pick = []
    # grab the coordinates of the bounding boxes
    x1 = boxes[:, 0]
    y1 = boxes[:, 1]
    x2 = boxes[:, 2]
    y2 = boxes[:, 3]
    sc = boxes[:, 4]
    widths = x2 - x1
    heights = y2 - y1

    # compute the area of the bounding boxes and sort the bounding
    # boxes by the bottom-right y-coordinate of the bounding box
    area = heights * widths
    idxs = numpy.argsort(sc)

    # keep looping while some indexes still remain in the indexes list
    while len(idxs) > 0:
        # grab the last index in the indexes list and add the
        # index value to the list of picked indexes
        last = len(idxs) - 1
        i = idxs[last]
        pick.append(i)

        # compare secend highest score boxes
        xx1 = numpy.maximum(x1[i], x1[idxs[:last]])
        yy1 = numpy.maximum(y1[i], y1[idxs[:last]])
        xx2 = numpy.minimum(x2[i], x2[idxs[:last]])
        yy2 = numpy.minimum(y2[i], y2[idxs[:last]])

        # compute the width and height of the box
        w = numpy.maximum(0, xx2 - xx1 + 1)
        h = numpy.maximum(0, yy2 - yy1 + 1)

        # compute the ratio of overlap
        overlap = (w * h) / area[idxs[:last]]

        # delete all indexes from the index list that have
        idxs = numpy.delete(idxs, numpy.concatenate(([last], numpy.where(overlap > overlap_threshold)[0])))

    # return only the bounding boxes that were picked using the
    # integer data type
    return boxes[pick]

# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
    def __init__(self, host_mem, device_mem):
        self.host = host_mem
        self.device = device_mem

    def __str__(self):
        return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

    def __repr__(self):
        return self.__str__()

class Inference_TensorRT:
    def __init__(self, onnx_file_path,
                 receptive_field_list,
                 receptive_field_stride,
                 bbox_small_list,
                 bbox_large_list,
                 receptive_field_center_start,
                 num_output_scales):

        temp_trt_file = os.path.join('trt_file_cache/', os.path.basename(onnx_file_path).replace('.onnx', '.trt'))

        load_trt_flag = False
        if not os.path.exists(temp_trt_file):
            if not os.path.exists(onnx_file_path):
                logging.error('ONNX file does not exist!')
                sys.exit(1)
            logging.info('Init engine from ONNX file.')
        else:
            load_trt_flag = True
            logging.info('Init engine from serialized engine.')

        self.receptive_field_list = receptive_field_list
        self.receptive_field_stride = receptive_field_stride
        self.bbox_small_list = bbox_small_list
        self.bbox_large_list = bbox_large_list
        self.receptive_field_center_start = receptive_field_center_start
        self.num_output_scales = num_output_scales
        self.constant = [i / 2.0 for i in self.receptive_field_list]

        # init log
        TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE)
        self.engine = None
        if load_trt_flag:
            with open(temp_trt_file, 'rb') as fin, trt.Runtime(TRT_LOGGER) as runtime:
                self.engine = runtime.deserialize_cuda_engine(fin.read())
        else:
            # declare builder object
            logging.info('Create TensorRT builder.')
            builder = trt.Builder(TRT_LOGGER)

            # get network object via builder
            logging.info('Create TensorRT network.')
            network = builder.create_network()

            # create ONNX parser object
            logging.info('Create TensorRT ONNX parser.')
            parser = trt.OnnxParser(network, TRT_LOGGER)

            with open(onnx_file_path, 'rb') as onnx_fin:
                parser.parse(onnx_fin.read())

            # print possible errors
            num_error = parser.num_errors
            if num_error != 0:
                logging.error('Errors occur while parsing the ONNX file!')
                for i in range(num_error):
                    temp_error = parser.get_error(i)
                    print(temp_error.desc())
                sys.exit(1)

            # create engine via builder
            builder.max_batch_size = 1
            builder.average_find_iterations = 2
            logging.info('Create TensorRT engine...')
            engine = builder.build_cuda_engine(network)

            # serialize engine
            if not os.path.exists('trt_file_cache/'):
                os.makedirs('trt_file_cache/')
            logging.info('Serialize the engine for fast init.')
            with open(os.path.join('trt_file_cache/', os.path.basename(onnx_file_path).replace('.onnx', '.trt')), 'wb') as fout:
                fout.write(engine.serialize())
            self.engine = engine

        self.output_shapes = []
        self.input_shapes = []
        for binding in self.engine:
            if self.engine.binding_is_input(binding):
                self.input_shapes.append(tuple([self.engine.max_batch_size] + list(self.engine.get_binding_shape(binding))))
            else:
                self.output_shapes.append(tuple([self.engine.max_batch_size] + list(self.engine.get_binding_shape(binding))))
        if len(self.input_shapes) != 1:
            logging.error('Only one input data is supported.')
            sys.exit(1)
        self.input_shape = self.input_shapes[0]
        logging.info('The required input size: %d, %d, %d' % (self.input_shape[2], self.input_shape[3], self.input_shape[1]))

        # create executor
        self.executor = self.engine.create_execution_context()
        self.inputs, self.outputs, self.bindings = self.__allocate_buffers(self.engine)

    def __allocate_buffers(self, engine):
        inputs = []
        outputs = []
        bindings = []
        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings

    def do_inference(self, image, score_threshold=0.4, top_k=10000, NMS_threshold=0.4, NMS_flag=True, skip_scale_branch_list=[]):

        if image.ndim != 3 or image.shape[2] != 3:
            print('Only RGB images are supported.')
            return None
        input_height = self.input_shape[2]
        input_width = self.input_shape[3]
        if image.shape[0] != input_height or image.shape[1] != input_width:
            logging.info('The size of input image is not %dx%d.\nThe input image will be resized keeping the aspect ratio.' % (input_height, input_width))

        input_batch = numpy.zeros((1, input_height, input_width, self.input_shape[1]), dtype=numpy.float32)
        left_pad = 0
        top_pad = 0
        if image.shape[0] / image.shape[1] > input_height / input_width:
            resize_scale = input_height / image.shape[0]
            input_image = cv2.resize(image, (0, 0), fx=resize_scale, fy=resize_scale)
            left_pad = int((input_width - input_image.shape[1]) / 2)
            input_batch[0, :, left_pad:left_pad + input_image.shape[1], :] = input_image
        else:
            resize_scale = input_width / image.shape[1]
            input_image = cv2.resize(image, (0, 0), fx=resize_scale, fy=resize_scale)
            top_pad = int((input_height - input_image.shape[0]) / 2)
            input_batch[0, top_pad:top_pad + input_image.shape[0], :, :] = input_image

        input_batch = input_batch.transpose([0, 3, 1, 2])
        input_batch = numpy.array(input_batch, dtype=numpy.float32, order='C')
        self.inputs[0].host = input_batch

        [cuda.memcpy_htod(inp.device, inp.host) for inp in self.inputs]
        self.executor.execute(batch_size=self.engine.max_batch_size, bindings=self.bindings)
        [cuda.memcpy_dtoh(output.host, output.device) for output in self.outputs]
        outputs = [out.host for out in self.outputs]
        outputs = [numpy.squeeze(output.reshape(shape)) for output, shape in zip(outputs, self.output_shapes)]

        bbox_collection = []
        for i in range(self.num_output_scales):
            if i in skip_scale_branch_list:
                continue

            score_map = numpy.squeeze(outputs[i * 2])

            # show feature maps-------------------------------
            # score_map_show = score_map * 255
            # score_map_show[score_map_show < 0] = 0
            # score_map_show[score_map_show > 255] = 255
            # cv2.imshow('score_map' + str(i), cv2.resize(score_map_show.astype(dtype=numpy.uint8), (0, 0), fx=2, fy=2))
            # cv2.waitKey()

            bbox_map = numpy.squeeze(outputs[i * 2 + 1])

            RF_center_Xs = numpy.array([self.receptive_field_center_start[i] + self.receptive_field_stride[i] * x for x in range(score_map.shape[1])])
            RF_center_Xs_mat = numpy.tile(RF_center_Xs, [score_map.shape[0], 1])
            RF_center_Ys = numpy.array([self.receptive_field_center_start[i] + self.receptive_field_stride[i] * y for y in range(score_map.shape[0])])
            RF_center_Ys_mat = numpy.tile(RF_center_Ys, [score_map.shape[1], 1]).T

            x_lt_mat = RF_center_Xs_mat - bbox_map[0, :, :] * self.constant[i]
            y_lt_mat = RF_center_Ys_mat - bbox_map[1, :, :] * self.constant[i]
            x_rb_mat = RF_center_Xs_mat - bbox_map[2, :, :] * self.constant[i]
            y_rb_mat = RF_center_Ys_mat - bbox_map[3, :, :] * self.constant[i]

            x_lt_mat = x_lt_mat
            x_lt_mat[x_lt_mat < 0] = 0
            y_lt_mat = y_lt_mat
            y_lt_mat[y_lt_mat < 0] = 0
            x_rb_mat = x_rb_mat
            x_rb_mat[x_rb_mat > input_width] = input_width
            y_rb_mat = y_rb_mat
            y_rb_mat[y_rb_mat > input_height] = input_height

            select_index = numpy.where(score_map > score_threshold)
            for idx in range(select_index[0].size):
                bbox_collection.append((x_lt_mat[select_index[0][idx], select_index[1][idx]] - left_pad,
                                        y_lt_mat[select_index[0][idx], select_index[1][idx]] - top_pad,
                                        x_rb_mat[select_index[0][idx], select_index[1][idx]] - left_pad,
                                        y_rb_mat[select_index[0][idx], select_index[1][idx]] - top_pad,
                                        score_map[select_index[0][idx], select_index[1][idx]]))

        # NMS
        bbox_collection = sorted(bbox_collection, key=lambda item: item[-1], reverse=True)
        if len(bbox_collection) > top_k:
            bbox_collection = bbox_collection[0:top_k]
        bbox_collection_numpy = numpy.array(bbox_collection, dtype=numpy.float32)
        bbox_collection_numpy = bbox_collection_numpy / resize_scale

        if NMS_flag:
            final_bboxes = NMS(bbox_collection_numpy, NMS_threshold)
            final_bboxes_ = []
            for i in range(final_bboxes.shape[0]):
                final_bboxes_.append((final_bboxes[i, 0], final_bboxes[i, 1], final_bboxes[i, 2], final_bboxes[i, 3], final_bboxes[i, 4]))

            return final_bboxes_
        else:
            return bbox_collection_numpy

######################################################################
# WHERE SECOND PART ACTUALLY RUNS                                   #
######################################################################

import sys
sys.path.append('..')
from config_farm import configuration_10_320_20L_5scales_v2 as cfg

onnx_file_path = './onnx_files/v2.onnx'
myInference = Inference_TensorRT(
    onnx_file_path=onnx_file_path,
    receptive_field_list=cfg.param_receptive_field_list,
    receptive_field_stride=cfg.param_receptive_field_stride,
    bbox_small_list=cfg.param_bbox_small_list,
    bbox_large_list=cfg.param_bbox_large_list,
    receptive_field_center_start=cfg.param_receptive_field_center_start,
    num_output_scales=cfg.param_num_output_scales)

do_inference = myInference.do_inference

test_img_lffd = test_img

# just for debugging
cv2.imwrite('test_img_lffd.jpg', test_img_lffd)

bboxes = do_inference(test_img_lffd, score_threshold=0.6, top_k=1000, NMS_threshold=0.2, NMS_flag=True)

bbox = bboxes[0]
bb_conf = bbox[4]
bb_left = bbox[0]
bb_top = bbox[1]
bb_right = bbox[2]
bb_bottom = bbox[3]

print(f'[lffd](l, t, r, b, c) = ({bb_left}, {bb_top}, {bb_right}, {bb_bottom}, {bb_conf})')

test_img_lffd = cv2.rectangle(test_img_lffd, (bb_left, bb_top), (bb_right, bb_bottom), freq_cv.GREEN, 2)

cv2.imwrite(f'[lffd] detected_lffd.jpg', test_img_lffd)

print(f'[lffd] detected_lffd.jpg written!')
print(f'[lffd] doing face recognition ... ')

# convert to int because LFFD returns numpy.float32 coordinates
css_type_face_location = [(int(bb_top), int(bb_right), int(bb_bottom), int(bb_left))]

print(f'[lffd]css = {css_type_face_location}')

face_encoding = get_face_encodings(test_img_lffd, css_type_face_location, 0)[0] # error showed up at this line

print(f'[lffd]face_encoding:\n{face_encoding}')

"The module pycuda.autoinit, when imported, automatically performs all the steps necessary to get CUDA ready for submission of compute kernels": link

I think dlib and pycuda.autoinit have different memory-handling mechanisms that conflicts with each other, or either is having a silent bug. The easiest way is to sacrifice one of them, but I want to use both LFFD (for detection), which needs pycuda.autoinit and dlib (for recognition), which hates pycuda.autoinit, so I have to do something to "synchronize" them. I haven't figured out how, because my C++ is bad, looking at the cudnn_dlibapi.cpp makes me nearly blind.

Can you help me, or suggest any alternative ideas? I would really really appreciate it :(

paddy2013 commented 4 years ago

I solved this by change the import libs order. keep import pycuda.autoinit before import face_recon in the whole program. I put import pycuda.autoinit in my startup program file and the problem is solved. here is the test program:

from PIL import Image;
import numpy as np;
import face_recognition as fr;
import pycuda.autoinit

image = Image.open("/home/nvidia/image5.jpg")
(width,height)=image.size;
image_np = np.array(image.getdata()).reshape((height,width,3)).astype(np.uint8)
results = fr.face_encodings(image_np)
print(results)

the program above has error:

RuntimeError: Error while calling cudnnConvolutionForward( context(), &alpha, descriptor(data), data.device(), (const cudnnFilterDescriptor_t)filter_handle, filters.device(), (const cudnnConvolutionDescriptor_t)conv_handle, (cudnnConvolutionFwdAlgo_t)forward_algo, forward_workspace, forward_workspace_size_in_bytes, &beta, descriptor(output), output.device()) in file /tmp/pip-install-rooy8wlc/dlib/dlib/cuda/cudnn_dlibapi.cpp:1004. code: 7, reason: A call to cuDNN failed when I change the order imports, the error disappear and the face coding print, here is the code:

from PIL import Image;
import numpy as np;
import pycuda.autoinit
import face_recognition as fr;

image = Image.open("/home/nvidia/image5.jpg")
(width,height)=image.size;
image_np = np.array(image.getdata()).reshape((height,width,3)).astype(np.uint8)
results = fr.face_encodings(image_np)
print(results)

I think the key may be that pycuda.autoinit must be import before face_recognition in the whole program, so you must check the order in all your files.

this is my env： Python 3.6.9 cuda: 10.0 tensorrt: 6.0.1.10

daynial132 commented 4 years ago

I have got the same error message . but I got this error after along time running the apps. i have created two apps running on same pc doing some different work. please help

ikros98 commented 2 years ago

I have got the same error message . but I got this error after along time running the apps. i have created two apps running on same pc doing some different work. please help

Have you managed to solve it?

ageitgey / face_recognition

Weird error: "RuntimeError: Error while calling cudnnConvolutionForward ... code: 7, reason: A call to cuDNN failed" #992