Problem

Hi, I was applying the FaceAlignment module to pre-process the VoxCeleb2 dataset. However, the GPU utilization rate is low (below 10%), and the estimated running time is extremely long (about a month using 8 3090 GPUs).

Potential cause

After profiling the code using pprofile, I found that the bottleneck lies in the post-processing in batch_detect function (code after net(img_batch)). Specifically, it took 51.35% of the total running time, which is detailed below.

Appendix of the profiling results

Profiling results of get_landmarks_from_batch function in FaceAlignment module (Note that self.face_detector.detect_from_batch(image_batch) and self.face_alignment_net(inp)[-1].detach() are the most time-consuming):

Line #|      Hits|         Time| Time per hit|      %|Source code
181|        76|  0.000495672|    6.522e-06|  0.00%|    @torch.no_grad()
(call)|         1|  1.69277e-05|  1.69277e-05|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/autograd/grad_mode.py:114 __init__
182|         1|  1.45435e-05|  1.45435e-05|  0.00%|    def get_landmarks_from_batch(self, image_batch, detected_faces=None):
(call)|         1|  0.000118494|  0.000118494|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/autograd/grad_mode.py:20 __call__
183|         0|            0|            0|  0.00%|        """Predict the landmarks for each face present in the image.
184|         0|            0|            0|  0.00%|
185|         0|            0|            0|  0.00%|        This function predicts a set of 68 2D or 3D images, one for each image in a batch in parallel.
186|         0|            0|            0|  0.00%|        If detect_faces is None the method will also run a face detector.
187|         0|            0|            0|  0.00%|
188|         0|            0|            0|  0.00%|         Arguments:
189|         0|            0|            0|  0.00%|            image_batch {torch.tensor} -- The input images batch
190|         0|            0|            0|  0.00%|
191|         0|            0|            0|  0.00%|        Keyword Arguments:
192|         0|            0|            0|  0.00%|            detected_faces {list of numpy.array} -- list of bounding boxes, one for each face found
193|         0|            0|            0|  0.00%|            in the image (default: {None})
194|         0|            0|            0|  0.00%|        """
195|         0|            0|            0|  0.00%|
196|        75|  0.000388145|  5.17527e-06|  0.00%|        if detected_faces is None:
197|        75|   0.00160074|  2.13432e-05|  0.00%|            detected_faces = self.face_detector.detect_from_batch(image_batch)
(call)|        75|      346.853|       4.6247| 52.50%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/detection/sfd/sfd_detector.py:46 detect_from_batch
198|         0|            0|            0|  0.00%|
199|        75|  0.000437737|  5.83649e-06|  0.00%|        if len(detected_faces) == 0:
200|         0|            0|            0|  0.00%|            print("Warning: No faces were detected.")
201|         0|            0|            0|  0.00%|            return None
202|         0|            0|            0|  0.00%|
203|        75|  0.000319719|  4.26292e-06|  0.00%|        landmarks = []
204|         0|            0|            0|  0.00%|        # A batch for each frame
205|      2205|   0.00754595|   3.4222e-06|  0.00%|        for i, faces in enumerate(detected_faces):
206|      2131|   0.00637937|   2.9936e-06|  0.00%|            landmark_set = []
207|      4264|    0.0141294|  3.31365e-06|  0.00%|            for face in faces:
208|      2134|   0.00639439|  2.99643e-06|  0.00%|                center = torch.FloatTensor(
209|      2134|    0.0290866|  1.36301e-05|  0.00%|                    [(face[2] + face[0]) / 2.0,
210|      2134|    0.0272794|  1.27832e-05|  0.00%|                     (face[3] + face[1]) / 2.0])
211|         0|            0|            0|  0.00%|
212|      2134|    0.0523539|  2.45332e-05|  0.01%|                center[1] = center[1] - (face[3] - face[1]) * 0.12
213|      2134|    0.0326109|  1.52816e-05|  0.00%|                scale = (face[2] - face[0] + face[3] - face[1]) / self.face_detector.reference_scale
(call)|      2134|    0.0116174|  5.44397e-06|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/detection/sfd/sfd_detector.py:57 reference_scale
214|      2134|     0.386902|  0.000181304|  0.06%|                image = image_batch[i].cpu().numpy()
215|         0|            0|            0|  0.00%|
216|      2134|    0.0204592|  9.58724e-06|  0.00%|                image = image.transpose(1, 2, 0)
217|         0|            0|            0|  0.00%|
218|      2134|    0.0412238|  1.93176e-05|  0.01%|                inp = crop(image, center, scale)
(call)|      2134|      2.17745|   0.00102036|  0.33%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/utils.py:98 crop
219|      2134|     0.205835|   9.6455e-05|  0.03%|                inp = torch.from_numpy(inp.transpose((2, 0, 1))).float()
220|         0|            0|            0|  0.00%|
221|      2134|     0.300562|  0.000140844|  0.05%|                inp = inp.to(self.device)
222|      2134|    0.0684037|  3.20542e-05|  0.01%|                inp.div_(255.0).unsqueeze_(0)
223|         0|            0|            0|  0.00%|
224|      2133|    0.0530906|  2.48901e-05|  0.01%|                out = self.face_alignment_net(inp)[-1].detach()
(call)|      2133|      242.676|     0.113772| 36.73%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/nn/modules/module.py:866 _call_impl
225|      2133|   0.00884414|  4.14634e-06|  0.00%|                if self.flip_input:
226|         0|            0|            0|  0.00%|                    out += flip(self.face_alignment_net(flip(inp))
227|         0|            0|            0|  0.00%|                                [-1].detach(), is_label=True)  # patched inp_batch undefined variable error
228|      2133|     0.351749|  0.000164908|  0.05%|                out = out.cpu()
229|      2133|    0.0427301|  2.00329e-05|  0.01%|                pts, pts_img = get_preds_fromhm(out, center, scale)
(call)|      2133|      29.1648|    0.0136731|  4.41%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/utils.py:138 get_preds_fromhm
230|         0|            0|            0|  0.00%|
231|         0|            0|            0|  0.00%|                # Added 3D landmark support
232|      2133|    0.0112429|  5.27092e-06|  0.00%|                if self.landmarks_type == LandmarksType._3D:
233|         0|            0|            0|  0.00%|                    pts, pts_img = pts.view(68, 2) * 4, pts_img.view(68, 2)
234|         0|            0|            0|  0.00%|                    heatmaps = np.zeros((68, 256, 256), dtype=np.float32)
235|         0|            0|            0|  0.00%|                    for i in range(68):
236|         0|            0|            0|  0.00%|                        if pts[i, 0] > 0:
237|         0|            0|            0|  0.00%|                            heatmaps[i] = draw_gaussian(
238|         0|            0|            0|  0.00%|                                heatmaps[i], pts[i], 2)
239|         0|            0|            0|  0.00%|                    heatmaps = torch.from_numpy(
240|         0|            0|            0|  0.00%|                        heatmaps).unsqueeze_(0)
241|         0|            0|            0|  0.00%|
242|         0|            0|            0|  0.00%|                    heatmaps = heatmaps.to(self.device)
243|         0|            0|            0|  0.00%|                    depth_pred = self.depth_prediciton_net(
244|         0|            0|            0|  0.00%|                        torch.cat((inp, heatmaps), 1)).data.cpu().view(68, 1)
245|         0|            0|            0|  0.00%|                    pts_img = torch.cat(
246|         0|            0|            0|  0.00%|                        (pts_img, depth_pred * (1.0 / (256.0 / (200.0 * scale)))), 1)
247|         0|            0|            0|  0.00%|                else:
248|      2133|    0.0328186|  1.53861e-05|  0.00%|                    pts, pts_img = pts.view(-1, 68, 2) * 4, pts_img.view(-1, 68, 2)
249|      2133|    0.0128355|  6.01758e-06|  0.00%|                landmark_set.append(pts_img.numpy())
250|      2130|   0.00785041|  3.68564e-06|  0.00%|            if 0 != len(landmark_set):
251|      2130|    0.0218399|  1.02535e-05|  0.00%|                landmark_set = np.concatenate(landmark_set, axis=0)
(call)|      2130|    0.0651903|  3.06058e-05|  0.01%|# <__array_function__ internals>_0:2 concatenate
252|      2130|   0.00722432|   3.3917e-06|  0.00%|            landmarks.append(landmark_set)
253|        74|  0.000219822|  2.97057e-06|  0.00%|        return landmarks

Profiling results of detect_from_batch function (batch_detect tooks 52.19% of total time).:

Line #|      Hits|         Time| Time per hit|      %|Source code
46|        76|  0.000217915|   2.8673e-06|  0.00%|    def detect_from_batch(self, tensor):
47|        75|     0.223546|   0.00298061|  0.03%|        bboxlists = batch_detect(self.face_detector, tensor, device=self.device)
(call)|        75|      344.793|      4.59724| 52.19%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/detection/sfd/detect.py:30 batch_detect
48|         0|            0|            0|  0.00%|
49|        75|  0.000767469|  1.02329e-05|  0.00%|        new_bboxlists = []
50|      2221|   0.00572896|  2.57945e-06|  0.00%|        for i in range(bboxlists.shape[0]):
51|      2146|   0.00584149|  2.72204e-06|  0.00%|            bboxlist = bboxlists[i]
52|      2146|    0.0157394|  7.33432e-06|  0.00%|            bboxlist = self._filter_bboxes(bboxlist)
(call)|      2146|      1.80209|  0.000839742|  0.27%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/detection/sfd/sfd_detector.py:30 _filter_bboxes
53|      2146|   0.00558567|  2.60283e-06|  0.00%|            new_bboxlists.append(bboxlist)
54|         0|            0|            0|  0.00%|
55|        75|  0.000172138|  2.29518e-06|  0.00%|        return new_bboxlists

Profiling results of batch_detect function (Note that net(img_batch.float()) nearly has not Time. The bottleneck is in the post-processing). :

Line #|      Hits|         Time| Time per hit|      %|Source code
30|        76|  0.000566721|  7.45685e-06|  0.00%|def batch_detect(net, img_batch, device):
31|         0|            0|            0|  0.00%|    """
32|         0|            0|            0|  0.00%|    Inputs:
33|         0|            0|            0|  0.00%|        - img_batch: a torch.Tensor of shape (Batch size, Channels, Height, Width)
34|         0|            0|            0|  0.00%|    """
35|         0|            0|            0|  0.00%|
36|        75|  0.000463724|  6.18299e-06|  0.00%|    if 'cuda' in device:
37|        75|   0.00108337|   1.4445e-05|  0.00%|        torch.backends.cudnn.benchmark = True
(call)|        75|   0.00174975|  2.33301e-05|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/backends/__init__.py:34 __set__
38|         0|            0|            0|  0.00%|
39|        75|   0.00083518|  1.11357e-05|  0.00%|    BB, CC, HH, WW = img_batch.size()
40|         0|            0|            0|  0.00%|
41|        75|   0.00114942|  1.53255e-05|  0.00%|    with torch.no_grad():
(call)|        75|   0.00131607|  1.75476e-05|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/autograd/grad_mode.py:114 __init__
(call)|        75|   0.00128579|  1.71439e-05|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/autograd/grad_mode.py:119 __enter__
42|        75|   0.00860119|  0.000114683|  0.00%|        olist = net(img_batch.float())  # patched uint8_t overflow error
(call)|        75|      5.53178|    0.0737571|  0.84%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/nn/modules/module.py:866 _call_impl
(call)|        75|   0.00190973|  2.54631e-05|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/autograd/grad_mode.py:123 __exit__
43|         0|            0|            0|  0.00%|
44|       525|   0.00243711|  4.64212e-06|  0.00%|    for i in range(len(olist) // 2):
45|       450|   0.00485539|  1.07898e-05|  0.00%|        olist[i * 2] = F.softmax(olist[i * 2], dim=1)
(call)|       450|     0.010896|  2.42133e-05|  0.00%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/nn/functional.py:1553 softmax
46|         0|            0|            0|  0.00%|
47|        75|  0.000318766|  4.25021e-06|  0.00%|    bboxlists = []
48|         0|            0|            0|  0.00%|
49|      1125|       1.8725|   0.00166444|  0.28%|    olist = [oelem.data.cpu() for oelem in olist]
(call)|        75|      1.87026|    0.0249368|  0.28%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/detection/sfd/detect.py:49 <listcomp>
50|         0|            0|            0|  0.00%|
51|      2221|   0.00759554|  3.41987e-06|  0.00%|    for j in range(BB):
52|      2146|   0.00711918|  3.31742e-06|  0.00%|        bboxlist = []
53|     15022|    0.0514858|  3.42736e-06|  0.01%|        for i in range(len(olist) // 2):
54|     12876|    0.0456545|  3.54571e-06|  0.01%|            ocls, oreg = olist[i * 2], olist[i * 2 + 1]
55|     12876|    0.0559468|  4.34505e-06|  0.01%|            FB, FC, FH, FW = ocls.size()  # feature map size
56|     12876|    0.0459166|  3.56606e-06|  0.01%|            stride = 2**(i + 2)    # 4,8,16,32,64,128
57|     12876|    0.0413544|  3.21174e-06|  0.01%|            anchor = stride * 4
58|     12876|     0.838079|  6.50885e-05|  0.13%|            poss = zip(*np.where(ocls[:, 1, :, :] > 0.05))
(call)|     12876|      1.25016|  9.70926e-05|  0.19%|# <__array_function__ internals>_2:2 where
59|   1415573|      5.17609|  3.65654e-06|  0.78%|            for Iindex, hindex, windex in poss:
60|   1402697|       13.566|  9.67136e-06|  2.05%|                axc, ayc = stride / 2 + windex * stride, stride / 2 + hindex * stride
61|   1402697|      11.5839|  8.25834e-06|  1.75%|                score = ocls[j, 1, hindex, windex]
62|   1402697|      19.2976|  1.37575e-05|  2.92%|                loc = oreg[j, :, hindex, windex].contiguous().view(1, 4)
63|   1402697|       10.743|  7.65881e-06|  1.63%|                priors = torch.Tensor([[axc / 1.0, ayc / 1.0, stride * 4 / 1.0, stride * 4 / 1.0]])
64|   1402697|      4.73283|   3.3741e-06|  0.72%|                variances = [0.1, 0.2]
65|   1402697|      12.8419|  9.15512e-06|  1.94%|                box = decode(loc, priors, variances)
(call)|   1402697|      108.447|  7.73133e-05| 16.41%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/face_alignment/detection/sfd/bbox.py:90 decode
66|   1402697|      21.5078|  1.53332e-05|  3.26%|                x1, y1, x2, y2 = box[0] * 1.0
(call)|   1402697|      22.9357|  1.63511e-05|  3.47%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/tensor.py:575 __iter__
67|   1402697|      6.08699|  4.33949e-06|  0.92%|                bboxlist.append([x1, y1, x2, y2, score])
68|         0|            0|            0|  0.00%|
69|      2146|   0.00730658|  3.40474e-06|  0.00%|        bboxlists.append(bboxlist)
70|         0|            0|            0|  0.00%|
71|        75|      41.9516|     0.559355|  6.35%|    bboxlists = np.array(bboxlists)
(call)|   7013485|      56.1279|  8.00286e-06|  8.50%|# /mnt3/xiao.wang/miniconda3/envs/deep3d/lib/python3.7/site-packages/torch/tensor.py:617 __array__
72|        75|    0.0020597|  2.74626e-05|  0.00%|    return bboxlists

1adrianb / face-alignment

Is there any way to parallelize the post-processing code in `batch_detect`? #343

Problem

Potential cause

Appendix of the profiling results