Closed appleweed closed 7 years ago
I noted in the following readme that multiple GPU's are for training, only? Is that still the case?
--> this is from a 3rd party library, and it is referring to the library.
OpenPose does uses multiple GPUs, and the speed should be approximately twice with 2 GPUs rather than 1.
Are you using a remote server and trying to send the image over Internet? This is normally the main case of OpenPose slowdown when >1 GPUs do not affect speed.
Yes, I'm using a remote server (Google Compute Instance running Ubuntu), but I'm timing just the process itself by running openpose.bin on the command line. So not including the roundtrip request/response to the server. I haven't tried 3 GPU's yet, but I'm definitely not getting 2x speed from 2 GPU's.
Please check the time and speed up between 1 and 2 GPUs when also using the flag no_display
.
I have been using -no_display
. But here are 3 runs again without -num_gpu
and then with 1 and 2, specified:
Num gpus not specified
./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0
Starting pose estimation demo. Auto-detecting GPUs... Detected 2 GPU(s), using them all. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.138351 seconds.
1 GPU specified
./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0 --num_gpu 1
Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.469052 seconds.
2 GPUs specified
./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0 --num_gpu 2
Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.180299 seconds.
So, there is about a 0.3 speed increase with 2 GPUs. Sometimes a little more or less.
Note again there is only one image in the --image_dir. I've tried scaling it down to 640x480 (88k file) to see if that helps. But that doesn't seem to have a noticeable effect.
Thanks for your help, btw!
As an aside, I did a little searching on GPU performance on Google Compute instances and there seems to be some talk that speed is noticeably slower. (See link below.) I want to compare to a local desktop next week and possibly an AWS instance. (Having a cloud option is desirable for a number of reasons.)
But OpenPose tries to use all the GPU computation (since Caffe does), so then I would say it is the server the one limiting the computation somehow. I do not use remote servers, but let me know if you find a solution so I can add it to the doc. Thanks!
Will do! I should know more this week.
Update: I'm now running openpose.bin
on an AWS EC2 instance with 8 NVIDIA Tesla K80's. I do not see a performance gain from running more than one GPU. But, overall, it is running much faster on AWS than GCP:
Single image. 1 GPU. No image rendering or display.
./build/examples/openpose/openpose.bin --image_dir images --write_images render --write_keypoint render -write_keypoint_format xml --keypoint_scale 3 --no_display --num_gpu 1
Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 2.823287 seconds.
2 GPU's
./build/examples/openpose/openpose.bin --image_dir images --write_keypoint render --write_keypoint_format xml --no_display --render_pose 0 --num_gpu 2
Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 3.188718 seconds.
3 GPU's
./build/examples/openpose/openpose.bin --image_dir images --write_keypoint render --write_keypoint_format xml --no_display --render_pose 0 --num_gpu 3
Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.052522 seconds.
Is there significant startup time relative to the number of GPU's? Maybe this is offsetting overall performance? Or, maybe, a better test would be leveraging the core library and not openpose.bin
?
At any rate, for my current purposes, < 3 secs is not bad. Cheers!
There is an starting time per GPU, so greater GPUs equals greater initial time to start.
But after it has been started, n GPUs should be a n
-times speed up. (at least up to 4 GPUs, I've never tried more than that but I assume it should be linear speed up until 6-7 GPUs)
If I detect pose on video or image dir,how to get the running time per frame? Another question is that, can I choose to show centain keypoints in body or hand? For example,I don't need to see keypoints like knee,ankle,hip. I find --part_to_show can choose only heatmap.
hello Did you use a video to test the speed ? I use one 1080 GPU and the speed is about 8fps, will it be accelerated when I use 2 GPUs ?
@guiyuliu I can only get the total time with --no_display,how to get the fps?
@guiyuliu Actually you are right, it is not possible to get the FPS without visualization unless you change the source code.
You can still do this: Run a 10-image folder Run a 110-image folder 1 / FPS = Time / image = (Time_110-image - time_10-image) / 100
(so u remove the time of opening and closing the GPUs) (And repeat this operation 2-3 times, if the times change a lot between them, definitively your server is running other things in the background and hence the bottleneck)
@gineshidalgo99 you mean if I want to get 2 times the speed ,except using 2 GPUs, I should also change source code ? this source code dosen't support multiple GPUs? BTW ,I'm dealing with long videos ,not images ,so I don't consider the time of opening and closing GPUs looking forward to your answer
@guiyuliu No! I explained it wrongly. I meant that there is no code in the library to check the FPS without visualizing it. As alternative, you can use the small math formulas I typed before.
The code includes the GPU parallelization, so n
GPUs should be around n
times faster. (at least up to 4-5 GPUs). I found that is not the case in some Amazon servers, maybe because Amazon somehow limits the maximum GPU/CPU computation.
Given that it's an old topic, and the problem was most probably the server not OpenPose, I'll close this. Feel free to post.
Yes, apologies I haven't replied recently! I was able to take the asynchronous example and create a background process that monitors a folder for new images and then processes them to an output folder. New images are cleared out when processed.
This allows the initial load to occur once and the rest is strictly OpenPose processing, which indeed is fast and seems to take advantage of multiple GPUs. I've tested this on an AWS EC2 instance and the results are good. Thank you for your help!
Smart idea, thanks for the feed-back!
@appleweed any chance you could share the background process wrapper that you've written? Im trying to prototype a similar wrapper.
@makeitraina Here ya go: https://github.com/appleweed/OpenPose-Background-Process
The README describes everything. Let me know if you have any questions.
@appleweed is possible to do that with Python? My question is, the performance would be the same?
I am trying to run openpose body keypoints detections through python API using it's built version. The code is running fine without any error. But the FPS achieved using python code is low compared to FPS achieved using openpose demo executable. I am running both on the same video , the same OS ( windows 10) and the same GPU H/W ( GTX 1080 ti, 11 GB memory ). FPS using demo executable is ~24 FPS and using pythomn code is ~17FPS. Is this a expected behaviour ?
Please @gineshidalgo99 , reopen this thread and look at the comment of @aditya15081990 (https://github.com/CMU-Perceptual-Computing-Lab/openpose/issues/213#issuecomment-607717388). I have the same issue. Thank you!
@moncio Python will be slower if you use asynchronous and feed in an image at a time because it serialises the whole read/OpenPose pipeline/output kps process for each image. Depending on your use case you might be able to use synchronous input, which should be faster. This is either Synchronous or AsynchronousOut. I made a patch to enable AsynchronousOut here: https://github.com/CMU-Perceptual-Computing-Lab/openpose/pull/1593
Hello @frankier, first of all thank you so much for your reply and explanation. I cloned your repo, do you have some "code demo" using this Synchronous mode? I'd need for [https://github.com/frankier/openpose/blob/python-api-async-out/examples/tutorial_api_python/05_keypoints_from_images_multi_gpu.py] , this is my use case. I want to compare the normal solution made right now by openpose authors and the yours one.
Thank you!
Pass ThreadManagerMode.AsynchronousOut to WrapperPython and then pass the image_dir flag rather than passing in the images manually via datums. You could start from this https://github.com/frankier/skelshop/blob/7f289605994ab6a10e41caf80030428cf6eebd6e/skeldump/openpose.py#L64 and then add in the multi gpu flag. It should improve the FPS to the same as openpose.bin -- but I haven't tested it with multiple GPUs so please let me know either way, and if my PR helps please let the OpenPose authors know in the PR discussion!
Sorry but, when I launch my script, it returns to me the following:
Error: Not available for this ThreadManagerMode.
Coming from: line 50, in start op_wrapper.waitAndEmplace(op.VectorDatum([datum]))
This is my script:
def __init__(self, source):
self.source = source
self.op_params = self.set_openpose_params()
self.num_gpus = self.op_params["num_gpu"] \
if "num_gpu" in self.op_params and \
self.op_params["num_gpu"] != -1 \
else op.get_gpu_number()
def start(self):
start = time.time()
images_path = op.get_images_on_directory(self.source)
# Starting OpenPose
op_wrapper = op.WrapperPython(op.ThreadManagerMode.AsynchronousOut)
op_wrapper.configure(self.op_params)
op_wrapper.start()
# Process and display images
for image_base_id in range(0, len(images_path), self.num_gpus):
images = []
# Read and push images into OpenPose wrapper.append(detections)
for gpuId in range(0, self.num_gpus):
image_id = image_base_id + gpuId
if image_id < len(images_path):
frame = cv2.imread(images_path[image_base_id + gpuId])
datum = op.Datum()
images.append(frame)
datum.cvInputData = images[-1]
op_wrapper.waitAndEmplace(op.VectorDatum([datum]))
# Retrieve processed results from OpenPose wrapper
for gpuId in range(0, self.num_gpus):
image_id = image_base_id + gpuId
if image_id < len(images_path):
datums = op.VectorDatum()
op_wrapper.waitAndPop(datums)
datum = datums[0]
print("Extracted pose keypoints of frame number: " + str(datum.id))
print("***********************************************************************")
print("Total time processing: {0:.2f}".format(time.time() - start))
def set_openpose_params(self):
params = dict()
params["model_folder"] = "models_path"
params["model_pose"] = "BODY_25"
params["number_people_max"] = 4
params["render_threshold"] = 0.5
params["face"] = True
params["image_dir"] = self.source
return params
So, it's okay?
In this mode, OpenPose manages taking images from the dir and assigning them to GPUs itself. So you don't need to place them in its queue. i.e. delete:
# Read and push images into OpenPose wrapper.append(detections)
for gpuId in range(0, self.num_gpus):
image_id = image_base_id + gpuId
if image_id < len(images_path):
frame = cv2.imread(images_path[image_base_id + gpuId])
datum = op.Datum()
images.append(frame)
datum.cvInputData = images[-1]
op_wrapper.waitAndEmplace(op.VectorDatum([datum]))
Yes, removing this part of the code, I observe the fps improves (~2 fps higher) than normal version and the total time of processing decreases (more or less the same time than the demo). Thank you so much, I think is a great advance for the community and only congrats to you for this amazing work. I really recommend to the authors (@gineshidalgo99) take a look to this thread and approves the pull-request for a new feature of the library.
One minimal question @frankier , how do you load the input image to the current Datum if you don't emplace the object to the vector?
OpenPose should be loading it itself using the image_dir
parameter.
yes, but in the case you want to process a video or stream? I know we have video
and ip_camera
for both params. But in the case you need to do a simple preprocessing in the input image of the video or the stream? how do you handle that?
Just add stuff to the params dict in the same format as you use to openpose.bin
Oh right. In that case you're out of luck as far as I understand. You can of course output a directory of images beforehand and then pass that in, but otherwise there's always the possibility of bottlenecking it at the input.
Issue summary
Executing ./build/examples/openpose/openpose.bin does not seem to take advantage of multiple GPU's even tho multiple GPUs are detected as indicated by the output message: "Auto-detecting GPUs... Detected 2 GPU(s), using them all."
Regardless if one GPU or two GPU's are used, the processing for a single image in the image_dir is the same. Approximately: 4.1 - 4.2 seconds.
I noted in the following readme that multiple GPU's are for training, only? Is that still the case?
https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/3rdparty/caffe/docs/multigpu.md
Note: I'm using a Google Compute instance with NVIDIA Tesla K80 GPU's.
Executed command (if any)
./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0
I've also tried specifying '-num_gpu 2' versus '-num_gpu 1'.
(I'm only retrieving the keypoint data and not generating an output image.)
OpenPose output (if any)
Starting pose estimation demo. Auto-detecting GPUs... Detected 2 GPU(s), using them all. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.150646 seconds.
Type of issue
Your system configuration
Operating system (
lsb_release -a
in Ubuntu): No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04 Codename: xenialCUDA version (
cat /usr/local/cuda/version.txt
in most cases): CUDA Version 8.0.61cuDNN version: (cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2)
define CUDNN_MAJOR 5
define CUDNN_MINOR 1
define CUDNN_PATCHLEVEL 10
GPU model (
nvidia-smi
in Ubuntu): +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000:00:04.0 Off | 0 | | N/A 29C P8 28W / 149W | 15MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 0000:00:05.0 Off | 0 | | N/A 29C P8 28W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1976 G /usr/lib/xorg/Xorg 15MiB | +-----------------------------------------------------------------------------+
Caffe version: Default from OpenPose or custom version. Default from OpenPose.
OpenCV version: 2.4.9.1, installed with
apt-get install libopencv-dev
(Ubuntu)Generation mode (only for Ubuntu): Makefile + Makefile.config (default, Ubuntu)
Compiler (
gcc --version
in Ubuntu): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609