SCZwangxiao commented 11 months ago

Background

I was processing large-scale human-talking datasets (~ 10M images), and found the GPU utilization rate is very low (below 10%) even using batch API.

As discussed in #343, I found the bottleneck to be the unparallelized get_predictions() after profiling the code.

I've solved this issue by proposing a parallelized implementation. A detailed explanation is below.

Explanation

Below is the original code. It's slow in that it has three loops, and we better parallelize one of them.

def get_predictions_original(olist: List[np.ndarray], batch_size: int) -> np.ndarray:
    bboxlists = []
    variances = [0.1, 0.2]
    for j in range(batch_size):
        bboxlist = []
        for i in range(len(olist) // 2):
            ocls, oreg = olist[i * 2], olist[i * 2 + 1]
            stride = 2**(i + 2)    # 4,8,16,32,64,128
            poss = zip(*np.where(ocls[:, 1, :, :] > 0.05))
            for Iindex, hindex, windex in poss:
                axc, ayc = stride / 2 + windex * stride, stride / 2 + hindex * stride
                score = ocls[j, 1, hindex, windex]
                loc = oreg[j, :, hindex, windex].copy().reshape(1, 4)
                priors = np.array([[axc / 1.0, ayc / 1.0, stride * 4 / 1.0, stride * 4 / 1.0]])
                box = decode(loc, priors, variances)
                x1, y1, x2, y2 = box[0]
                bboxlist.append([x1, y1, x2, y2, score])

        bboxlists.append(bboxlist)

    bboxlists = np.array(bboxlists)

Note that the batch index j appears only in the inner loop, we exchange for-loop order to make things clearer:

def get_predictions_v1(olist: List[np.ndarray], batch_size: int) -> np.ndarray:
bboxlists = [[] for _ in range(batch_size)] # Changed
variances = [0.1, 0.2]
for i in range(len(olist) // 2):  # Changed
    ocls, oreg = olist[i * 2], olist[i * 2 + 1]
    stride = 2**(i + 2)
    poss = zip(*np.where(ocls[:, 1, :, :] > 0.05))
    for Iindex, hindex, windex in poss:
        axc, ayc = stride / 2 + windex * stride, stride / 2 + hindex * stride
        priors = np.array([[axc / 1.0, ayc / 1.0, stride * 4 / 1.0, stride * 4 / 1.0]])
        for j in range(batch_size):  # Changed
            score = ocls[j, 1, hindex, windex]
            loc = oreg[j, :, hindex, windex].copy().reshape(1, 4)
            box = decode(loc, priors, variances)
            x1, y1, x2, y2 = box[0]
            bboxlists[j].append([x1, y1, x2, y2, score]) # Changed

bboxlists = np.array(bboxlists)
return bboxlists

Finally, it's straightforward that the batch_size loop can be parallelized:

def get_predictions_v2(olist: List[np.ndarray], batch_size: int) -> np.ndarray:
bboxlists = []
variances = [0.1, 0.2]
for i in range(len(olist) // 2):
    ocls, oreg = olist[i * 2], olist[i * 2 + 1]
    stride = 2**(i + 2)
    poss = zip(*np.where(ocls[:, 1, :, :] > 0.05))
    for Iindex, hindex, windex in poss:
        axc, ayc = stride / 2 + windex * stride, stride / 2 + hindex * stride
        priors = np.array([[axc / 1.0, ayc / 1.0, stride * 4 / 1.0, stride * 4 / 1.0]])
        #### Below are the changes ####
        score = ocls[:, 1, hindex, windex][:,None]
        loc = oreg[:, :, hindex, windex].copy()
        boxes = decode(loc, priors, variances) # decode() function is luckily suitable for new code
        bboxlists.append(np.concatenate((boxes, score), axis=1))

if len(bboxlists) == 0:
    # Here for the consistency with the original code when no face is detected.
    bboxlists = np.array([[] for _ in range(batch_size)])
else:
    bboxlists = np.stack(bboxlists, axis=1)
return bboxlists

SCZwangxiao commented 11 months ago

The unitest test/facealignment_test.py has failed, but it succeeded in my env. That's strange.

emlcpfx commented 10 months ago

Hi, @SCZwangxiao I got a 10% boost in performance using V1. V2 throws an error for me about thr.

TypeError: get_predictions() missing 1 required positional argument: 'thr'

Any thoughts on how I can get that working?

SCZwangxiao commented 10 months ago

Hi, @SCZwangxiao I got a 10% boost in performance using V1. V2 throws an error for me about thr.

TypeError: get_predictions() missing 1 required positional argument: 'thr'

Any thoughts on how I can get that working?

Sorry for the typo. I've update the correct version of V2 code.

thr refers to the 0.05 in poss = zip(*np.where(ocls[:, 1, :, :] > 0.05)). We use thr to filter low-confidence candidates in our private project.

emlcpfx commented 10 months ago

Thanks. It works now! I'm getting slightly faster results with v1 than v2. They're both faster than the original.

1adrianb commented 10 months ago

Thanks for your contribution @SCZwangxiao , looks good! Will check what is going on with the test, seams to be fine locally indeed.

1adrianb / face-alignment

Speedup face detection moudle by parallelizing `get_predictions()` #347

Background

Explanation