High CPU Usage, not fully use GPU

zhongyi-zhou commented 5 years ago

running video_demo.py uses 100% of my 10 cores CPU but it only uses about 50% of my 2080ti, though it is usually fluctuating (20% ~80%). I guess there should some high burden on CPU computation, which dramatically slows down computation efficiency.

Any idea on how to fix this?

zhongyi-zhou commented 5 years ago

The result from --profile is:

det time: 0.058 | pose time: 0.08 | post processing: 0.0616

Fang-Haoshu commented 5 years ago

Hi, the process of cropping every single person from the image will use CPU a lot. How many people are there in an image in your case?

zhongyi-zhou commented 5 years ago

3 people. Do you mean the person bbox dection part by cropping?

Also, I noticed that the beginning several iterations have 0 det time cost, which is a bit strange.

Fang-Haoshu commented 5 years ago

It's a bit weird. Are you running with --sp?

zhongyi-zhou commented 5 years ago

@Fang-Haoshu Yes. Otherwise, it would appear errors.

Fang-Haoshu commented 5 years ago

Are you running under Windows? If under Linux, what error would occur?

zhongyi-zhou commented 5 years ago

No. I am under Linux. The error is the same as discussed in this issue: https://github.com/MVIG-SJTU/AlphaPose/issues/101 I notice that there is also one person @GuoHaiYang123 in that issue who meets the same problem with me.

Fang-Haoshu commented 5 years ago

I guess the latest pytorch branch has fixed it? Are you running with the latest code?

trekze commented 5 years ago

I can't read chinese unfortunately, but am experiencing the same issue. a 9700K is bottlenecking (8 cores running at 100%) a GTX1070 (70% gpu utilization). I also have another setup where a weaker mobile CPU (4 cores running at 100%) is bottlenecking a 2070 egpu (30% utilization). Windows both cases.

This is the observed behaviour whether I use the webcam or video processing script. With one human in the frame I get 13fps for the first described setup, and 6fps for the second.

trekze commented 5 years ago

Here is a visualization of the CPU usage, using snakeviz:

python -m cProfile -o temp.dat video_demo.py --conf 0.5 --nms 0.45 --inp_dim 480 --sp --video p1.mp4

trekze commented 5 years ago

and the profile figures:

det time: 0.018 | pose time: 0.06 | post processing: 0.0029

Fang-Haoshu commented 5 years ago

Hi @hmexx , if this problem happens for the video, I guess it's related to video decoding. Perhaps you need to install some video decoder. It can consume a lot of CPU usages.

trekze commented 5 years ago

Hi @Fang-Haoshu

This can not be a video decoding issue. It's happening both for video files and the webcam, and the video is an uncompressed 480p video, that can be decoded for less than 1% of the CPU power available. Also, the profile figures show that most of the time is spent in pose time:

Any other ideas?

Thanks

trekze commented 5 years ago

@Fang-Haoshu Could it be that AlphaPose is CPU-bound where there are few people in each frame (e.g. 1) ? Are all your test runs with many people (e.g. 4 mentioned in README.md)

All our videos have 1 person.

Fang-Haoshu commented 5 years ago

Hi, both video and webcam use cv2.videoCapture: https://github.com/MVIG-SJTU/AlphaPose/blob/pytorch/dataloader_webcam.py#L41 I think it's the same point that they share. AlphaPose should use less CPU with fewer people. It will only use more CPU resources when there are many people, for cropping the people from images.

Fang-Haoshu commented 5 years ago

For my laptop, it consumes little CPU resource when running webcam demo. Thus I still guess it's related to cv2.videocapture

trekze commented 5 years ago

It's not videocapture. I've measured it and it takes very little CPU. Is your laptop linux or windows? How many people are in each frame you are testing with it?

We've managed to get a 3x fps increase using the following strategy:

Concatenate 4 image frames, and YOLO/pose them simultaneously (to increase the number of poses per frame from 1 to 4). This increases the GPU utilization.

2) Optimize the cropBox method, where most of the CPU time is spent. I'm including it below. I've removed the warpAffine call. Feel free to reuse, and do let me know if you think this alternative implementation will break pose estimation in certain cases. It seems to work ok for us.

Thanks

def cropBox_fast(img, ul, br, resH, resW):
    ul = ul.int()
    br = (br - 1).int()
    lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
    lenW = lenH * resW / resH
    if img.dim() == 2:
        img = img[np.newaxis, :]

    crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
    pad_h = lenH - crop_img.shape[1]
    pad_w = lenW - crop_img.shape[2]
    pad_img = cv2.copyMakeBorder(torch_to_im(crop_img), int(pad_h/2), int(pad_h/2), int(pad_w/2), int(pad_w/2), cv2.BORDER_CONSTANT, value=(0, 0, 0))
    sized_img = cv2.resize(pad_img, (resW, resH),interpolation=cv2.INTER_NEAREST)
    return im_to_torch(torch.Tensor(sized_img))

zhongyi-zhou commented 5 years ago

@hmexx @Fang-Haoshu What about taking a random video from YouTube and we test together for this fixed video and share the info here?

Fang-Haoshu commented 5 years ago

Oh I see. Thanks @hmexx ! Good idea Joey. https://drive.google.com/file/d/1CNXFEvB6X68eUNpIGo3F_tugB7gGyiCg/view?usp=sharing How about try this video?

Fang-Haoshu commented 5 years ago

Oh wait, I found that it also consumes a lot CPU on my side 😂 Sorry for my wrong statements before 😂 I did not notice that because I was using a server with 56 CPU cores. Now it consumes about 1000% CPU usage. 微信图片_20191102131116

I guess we will need to optimize the CPU utility after CVPR ddl..

zhongyi-zhou commented 5 years ago

@Fang-Haoshu Good Luck on CVPR! I am glad to help with this if you need. Let me know if you start this optimization.

trekze commented 5 years ago

Here's how to further speed up cropping. Another 2x speed-up:


def cropBox_fast(img, ul, br, resH, resW):
    ul = ul.int()
    br = (br - 1).int()
    lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
    lenW = lenH * resW / resH
    if img.dim() == 2:
        img = img[np.newaxis, :]
    crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
    pad_h = lenH - crop_img.shape[1]
    pad_w = lenW - crop_img.shape[2]
    pad_img = F.pad(crop_img, (int(pad_w/2), int(pad_w/2), int(pad_h/2), int(pad_h/2)), 'constant', 0)
    sized_img = F.interpolate(pad_img.unsqueeze(0),size=(resH, resW)).squeeze(0)
    return sized_img

def crop_from_dets(img, boxes, inps, pt1, pt2):
    '''
    Crop human from origin image according to Dectecion Results
    '''

    imght = img.size(1)
    imgwidth = img.size(2)
    tmp_img = img
    tmp_img = tmp_img.cuda()
    tmp_img[0].add_(-0.406)
    tmp_img[1].add_(-0.457)
    tmp_img[2].add_(-0.480)
    for i, box in enumerate(boxes):
        upLeft = torch.Tensor(
            (float(box[0]), float(box[1])))
        bottomRight = torch.Tensor(
            (float(box[2]), float(box[3])))

        ht = bottomRight[1] - upLeft[1]
        width = bottomRight[0] - upLeft[0]

        scaleRate = 0.3
        t1 = time.time()
        upLeft[0] = max(0, upLeft[0] - width * scaleRate / 2)
        upLeft[1] = max(0, upLeft[1] - ht * scaleRate / 2)

        bottomRight[0] = max(
            min(imgwidth - 1, bottomRight[0] + width * scaleRate / 2), upLeft[0] + 5)
        bottomRight[1] = max(
            min(imght - 1, bottomRight[1] + ht * scaleRate / 2), upLeft[1] + 5)
        try:
            c = tmp_img.clone()
            inps[i] = cropBox_fast(c, upLeft, bottomRight, opt.inputResH, opt.inputResW)
        except IndexError:
            print(tmp_img.shape)
            print(upLeft)
            print(bottomRight)
            print('===')
        pt1[i] = upLeft
        pt2[i] = bottomRight

    return inps, pt1, pt2

trekze commented 5 years ago

There's also some code that speeds up the YOLO NMS process slightly if anyone is interested.

arvindixonos commented 5 years ago

@hmexx Please post the speed up code

Fang-Haoshu commented 4 years ago

@hmexx Hi! Many thanks! We are now working actively for new version alphapose and would like to include the speed up part. Would you mind sharing us the code? Many thanks!

trekze commented 4 years ago

Hi there. Sorry just saw this. Most of the speed up is the code above. There's a tiny bit more in the NMS we changed. Do you want that bit?

Ndron commented 4 years ago

@hmexx Hi , can you post nms change cod ? It will be very useful. Thanks.

zhongyi-zhou commented 4 years ago

Is there any update to change the CPU load to GPU? @Fang-Haoshu

jmwill86 commented 4 years ago

I'm getting this issue too on demo_inference.py with --video (which I assume was video_demo.py). --detbatch, --posebatch and --sp does't help. -sp helps but my OS grinds to a halt eventually on longer videos. I'm running 12GB RAM, 8 core CPU with 2070 RTX Super.

Happy to help to get it resolved if I can in any way. I'm newish to python, but not to coding so I'm sure I can get something of use. Great project.

samymdihi commented 3 years ago

Any update on this issue ? When I run demo_inference.py it uses about 12% of the gpu (RTX 3090) and 100% of the cpu

samymdihi commented 3 years ago

Here's how to further speed up cropping. Another 2x speed-up:

def cropBox_fast(img, ul, br, resH, resW):
    ul = ul.int()
    br = (br - 1).int()
    lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
    lenW = lenH * resW / resH
    if img.dim() == 2:
        img = img[np.newaxis, :]
    crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
    pad_h = lenH - crop_img.shape[1]
    pad_w = lenW - crop_img.shape[2]
    pad_img = F.pad(crop_img, (int(pad_w/2), int(pad_w/2), int(pad_h/2), int(pad_h/2)), 'constant', 0)
    sized_img = F.interpolate(pad_img.unsqueeze(0),size=(resH, resW)).squeeze(0)
    return sized_img

def crop_from_dets(img, boxes, inps, pt1, pt2):
    '''
    Crop human from origin image according to Dectecion Results
    '''

    imght = img.size(1)
    imgwidth = img.size(2)
    tmp_img = img
    tmp_img = tmp_img.cuda()
    tmp_img[0].add_(-0.406)
    tmp_img[1].add_(-0.457)
    tmp_img[2].add_(-0.480)
    for i, box in enumerate(boxes):
        upLeft = torch.Tensor(
            (float(box[0]), float(box[1])))
        bottomRight = torch.Tensor(
            (float(box[2]), float(box[3])))

        ht = bottomRight[1] - upLeft[1]
        width = bottomRight[0] - upLeft[0]

        scaleRate = 0.3
        t1 = time.time()
        upLeft[0] = max(0, upLeft[0] - width * scaleRate / 2)
        upLeft[1] = max(0, upLeft[1] - ht * scaleRate / 2)

        bottomRight[0] = max(
            min(imgwidth - 1, bottomRight[0] + width * scaleRate / 2), upLeft[0] + 5)
        bottomRight[1] = max(
            min(imght - 1, bottomRight[1] + ht * scaleRate / 2), upLeft[1] + 5)
        try:
            c = tmp_img.clone()
            inps[i] = cropBox_fast(c, upLeft, bottomRight, opt.inputResH, opt.inputResW)
        except IndexError:
            print(tmp_img.shape)
            print(upLeft)
            print(bottomRight)
            print('===')
        pt1[i] = upLeft
        pt2[i] = bottomRight

    return inps, pt1, pt2

@hmexx Can you please tell me in which files you changed those functions ? I found something similar in transform.py but the arguments are a bit different

trekze commented 3 years ago

@samymdihi Hi dude. Sorry it was so long ago I have no idea. I've moved on to different projects.

samymdihi commented 3 years ago

@samymdihi Hi dude. Sorry it was so long ago I have no idea. I've moved on to different projects.

Hi @hmexx, thanks for your answer. If you still have the project and could see where you changed it I would be so grateful. It would help me a lot

samylee commented 3 years ago

hello guys, go and find c++ library: https://github.com/samylee/SamyleePoseApi-OpenLibrary

Martin-si-Daoishi commented 10 months ago

Hi, both video and webcam use cv2.videoCapture: https://github.com/MVIG-SJTU/AlphaPose/blob/pytorch/dataloader_webcam.py#L41 I think it's the same point that they share. AlphaPose should use less CPU with fewer people. It will only use more CPU resources when there are many people, for cropping the people from images.

I am trying to find where alphapose get the webcam, this is the pytorch version. I wonder where does the master(default) version get the webcam? I find that webcam_detector.py only run once. And other cv2.Videocapture(0) are in yolo files.

MVIG-SJTU / AlphaPose

High CPU Usage, not fully use GPU #435