Open zhongyi-zhou opened 5 years ago
The result from --profile
is:
det time: 0.058 | pose time: 0.08 | post processing: 0.0616
Hi, the process of cropping every single person from the image will use CPU a lot. How many people are there in an image in your case?
3 people. Do you mean the person bbox dection part by cropping?
Also, I noticed that the beginning several iterations have 0 det time cost, which is a bit strange.
It's a bit weird. Are you running with --sp?
@Fang-Haoshu Yes. Otherwise, it would appear errors.
Are you running under Windows? If under Linux, what error would occur?
No. I am under Linux. The error is the same as discussed in this issue: https://github.com/MVIG-SJTU/AlphaPose/issues/101 I notice that there is also one person @GuoHaiYang123 in that issue who meets the same problem with me.
I guess the latest pytorch branch has fixed it? Are you running with the latest code?
I can't read chinese unfortunately, but am experiencing the same issue. a 9700K is bottlenecking (8 cores running at 100%) a GTX1070 (70% gpu utilization). I also have another setup where a weaker mobile CPU (4 cores running at 100%) is bottlenecking a 2070 egpu (30% utilization). Windows both cases.
This is the observed behaviour whether I use the webcam or video processing script. With one human in the frame I get 13fps for the first described setup, and 6fps for the second.
Here is a visualization of the CPU usage, using snakeviz:
python -m cProfile -o temp.dat video_demo.py --conf 0.5 --nms 0.45 --inp_dim 480 --sp --video p1.mp4
and the profile figures:
det time: 0.018 | pose time: 0.06 | post processing: 0.0029
Hi @hmexx , if this problem happens for the video, I guess it's related to video decoding. Perhaps you need to install some video decoder. It can consume a lot of CPU usages.
Hi @Fang-Haoshu
This can not be a video decoding issue. It's happening both for video files and the webcam, and the video is an uncompressed 480p video, that can be decoded for less than 1% of the CPU power available. Also, the profile figures show that most of the time is spent in pose time:
Any other ideas?
Thanks
@Fang-Haoshu Could it be that AlphaPose is CPU-bound where there are few people in each frame (e.g. 1) ? Are all your test runs with many people (e.g. 4 mentioned in README.md)
All our videos have 1 person.
Hi, both video and webcam use cv2.videoCapture: https://github.com/MVIG-SJTU/AlphaPose/blob/pytorch/dataloader_webcam.py#L41 I think it's the same point that they share. AlphaPose should use less CPU with fewer people. It will only use more CPU resources when there are many people, for cropping the people from images.
For my laptop, it consumes little CPU resource when running webcam demo. Thus I still guess it's related to cv2.videocapture
It's not videocapture. I've measured it and it takes very little CPU. Is your laptop linux or windows? How many people are in each frame you are testing with it?
We've managed to get a 3x fps increase using the following strategy:
2) Optimize the cropBox method, where most of the CPU time is spent. I'm including it below. I've removed the warpAffine
call. Feel free to reuse, and do let me know if you think this alternative implementation will break pose estimation in certain cases. It seems to work ok for us.
Thanks
def cropBox_fast(img, ul, br, resH, resW):
ul = ul.int()
br = (br - 1).int()
lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
lenW = lenH * resW / resH
if img.dim() == 2:
img = img[np.newaxis, :]
crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
pad_h = lenH - crop_img.shape[1]
pad_w = lenW - crop_img.shape[2]
pad_img = cv2.copyMakeBorder(torch_to_im(crop_img), int(pad_h/2), int(pad_h/2), int(pad_w/2), int(pad_w/2), cv2.BORDER_CONSTANT, value=(0, 0, 0))
sized_img = cv2.resize(pad_img, (resW, resH),interpolation=cv2.INTER_NEAREST)
return im_to_torch(torch.Tensor(sized_img))
@hmexx @Fang-Haoshu What about taking a random video from YouTube and we test together for this fixed video and share the info here?
Oh I see. Thanks @hmexx ! Good idea Joey. https://drive.google.com/file/d/1CNXFEvB6X68eUNpIGo3F_tugB7gGyiCg/view?usp=sharing How about try this video?
Oh wait, I found that it also consumes a lot CPU on my side 😂 Sorry for my wrong statements before 😂 I did not notice that because I was using a server with 56 CPU cores. Now it consumes about 1000% CPU usage.
I guess we will need to optimize the CPU utility after CVPR ddl..
@Fang-Haoshu Good Luck on CVPR! I am glad to help with this if you need. Let me know if you start this optimization.
Here's how to further speed up cropping. Another 2x speed-up:
def cropBox_fast(img, ul, br, resH, resW):
ul = ul.int()
br = (br - 1).int()
lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW)
lenW = lenH * resW / resH
if img.dim() == 2:
img = img[np.newaxis, :]
crop_img = img[:,ul[1]:br[1], ul[0]:br[0]]
pad_h = lenH - crop_img.shape[1]
pad_w = lenW - crop_img.shape[2]
pad_img = F.pad(crop_img, (int(pad_w/2), int(pad_w/2), int(pad_h/2), int(pad_h/2)), 'constant', 0)
sized_img = F.interpolate(pad_img.unsqueeze(0),size=(resH, resW)).squeeze(0)
return sized_img
def crop_from_dets(img, boxes, inps, pt1, pt2):
'''
Crop human from origin image according to Dectecion Results
'''
imght = img.size(1)
imgwidth = img.size(2)
tmp_img = img
tmp_img = tmp_img.cuda()
tmp_img[0].add_(-0.406)
tmp_img[1].add_(-0.457)
tmp_img[2].add_(-0.480)
for i, box in enumerate(boxes):
upLeft = torch.Tensor(
(float(box[0]), float(box[1])))
bottomRight = torch.Tensor(
(float(box[2]), float(box[3])))
ht = bottomRight[1] - upLeft[1]
width = bottomRight[0] - upLeft[0]
scaleRate = 0.3
t1 = time.time()
upLeft[0] = max(0, upLeft[0] - width * scaleRate / 2)
upLeft[1] = max(0, upLeft[1] - ht * scaleRate / 2)
bottomRight[0] = max(
min(imgwidth - 1, bottomRight[0] + width * scaleRate / 2), upLeft[0] + 5)
bottomRight[1] = max(
min(imght - 1, bottomRight[1] + ht * scaleRate / 2), upLeft[1] + 5)
try:
c = tmp_img.clone()
inps[i] = cropBox_fast(c, upLeft, bottomRight, opt.inputResH, opt.inputResW)
except IndexError:
print(tmp_img.shape)
print(upLeft)
print(bottomRight)
print('===')
pt1[i] = upLeft
pt2[i] = bottomRight
return inps, pt1, pt2
There's also some code that speeds up the YOLO NMS process slightly if anyone is interested.
@hmexx Please post the speed up code
@hmexx Hi! Many thanks! We are now working actively for new version alphapose and would like to include the speed up part. Would you mind sharing us the code? Many thanks!
Hi there. Sorry just saw this. Most of the speed up is the code above. There's a tiny bit more in the NMS we changed. Do you want that bit?
@hmexx Hi , can you post nms change cod ? It will be very useful. Thanks.
Is there any update to change the CPU load to GPU? @Fang-Haoshu
I'm getting this issue too on demo_inference.py with --video (which I assume was video_demo.py). --detbatch, --posebatch and --sp does't help. -sp helps but my OS grinds to a halt eventually on longer videos. I'm running 12GB RAM, 8 core CPU with 2070 RTX Super.
Happy to help to get it resolved if I can in any way. I'm newish to python, but not to coding so I'm sure I can get something of use. Great project.
Any update on this issue ? When I run demo_inference.py it uses about 12% of the gpu (RTX 3090) and 100% of the cpu
Here's how to further speed up cropping. Another 2x speed-up:
def cropBox_fast(img, ul, br, resH, resW): ul = ul.int() br = (br - 1).int() lenH = max((br[1] - ul[1]).item(), (br[0] - ul[0]).item() * resH / resW) lenW = lenH * resW / resH if img.dim() == 2: img = img[np.newaxis, :] crop_img = img[:,ul[1]:br[1], ul[0]:br[0]] pad_h = lenH - crop_img.shape[1] pad_w = lenW - crop_img.shape[2] pad_img = F.pad(crop_img, (int(pad_w/2), int(pad_w/2), int(pad_h/2), int(pad_h/2)), 'constant', 0) sized_img = F.interpolate(pad_img.unsqueeze(0),size=(resH, resW)).squeeze(0) return sized_img def crop_from_dets(img, boxes, inps, pt1, pt2): ''' Crop human from origin image according to Dectecion Results ''' imght = img.size(1) imgwidth = img.size(2) tmp_img = img tmp_img = tmp_img.cuda() tmp_img[0].add_(-0.406) tmp_img[1].add_(-0.457) tmp_img[2].add_(-0.480) for i, box in enumerate(boxes): upLeft = torch.Tensor( (float(box[0]), float(box[1]))) bottomRight = torch.Tensor( (float(box[2]), float(box[3]))) ht = bottomRight[1] - upLeft[1] width = bottomRight[0] - upLeft[0] scaleRate = 0.3 t1 = time.time() upLeft[0] = max(0, upLeft[0] - width * scaleRate / 2) upLeft[1] = max(0, upLeft[1] - ht * scaleRate / 2) bottomRight[0] = max( min(imgwidth - 1, bottomRight[0] + width * scaleRate / 2), upLeft[0] + 5) bottomRight[1] = max( min(imght - 1, bottomRight[1] + ht * scaleRate / 2), upLeft[1] + 5) try: c = tmp_img.clone() inps[i] = cropBox_fast(c, upLeft, bottomRight, opt.inputResH, opt.inputResW) except IndexError: print(tmp_img.shape) print(upLeft) print(bottomRight) print('===') pt1[i] = upLeft pt2[i] = bottomRight return inps, pt1, pt2
@hmexx Can you please tell me in which files you changed those functions ? I found something similar in transform.py but the arguments are a bit different
@samymdihi Hi dude. Sorry it was so long ago I have no idea. I've moved on to different projects.
@samymdihi Hi dude. Sorry it was so long ago I have no idea. I've moved on to different projects.
Hi @hmexx, thanks for your answer. If you still have the project and could see where you changed it I would be so grateful. It would help me a lot
hello guys, go and find c++ library: https://github.com/samylee/SamyleePoseApi-OpenLibrary
Hi, both video and webcam use cv2.videoCapture: https://github.com/MVIG-SJTU/AlphaPose/blob/pytorch/dataloader_webcam.py#L41 I think it's the same point that they share. AlphaPose should use less CPU with fewer people. It will only use more CPU resources when there are many people, for cropping the people from images.
I am trying to find where alphapose get the webcam, this is the pytorch version. I wonder where does the master(default) version get the webcam? I find that webcam_detector.py only run once. And other cv2.Videocapture(0) are in yolo files.
running video_demo.py uses 100% of my 10 cores CPU but it only uses about 50% of my 2080ti, though it is usually fluctuating (20% ~80%). I guess there should some high burden on CPU computation, which dramatically slows down computation efficiency.
Any idea on how to fix this?