CMU-Perceptual-Computing-Lab / openpose

OpenPose: Real-time multi-person keypoint detection library for body, face, hands, and foot estimation
https://cmu-perceptual-computing-lab.github.io/openpose
Other
30.96k stars 7.84k forks source link

openpose.bin does not seem to take advantage of multiple GPU's #213

Closed appleweed closed 7 years ago

appleweed commented 7 years ago

Issue summary

Executing ./build/examples/openpose/openpose.bin does not seem to take advantage of multiple GPU's even tho multiple GPUs are detected as indicated by the output message: "Auto-detecting GPUs... Detected 2 GPU(s), using them all."

Regardless if one GPU or two GPU's are used, the processing for a single image in the image_dir is the same. Approximately: 4.1 - 4.2 seconds.

I noted in the following readme that multiple GPU's are for training, only? Is that still the case?

Currently Multi-GPU is only supported via the C/C++ paths and only for training.

https://github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/3rdparty/caffe/docs/multigpu.md

Note: I'm using a Google Compute instance with NVIDIA Tesla K80 GPU's.

Executed command (if any)

./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0

I've also tried specifying '-num_gpu 2' versus '-num_gpu 1'.

(I'm only retrieving the keypoint data and not generating an output image.)

OpenPose output (if any)

Starting pose estimation demo. Auto-detecting GPUs... Detected 2 GPU(s), using them all. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.150646 seconds.

Type of issue

Your system configuration

Operating system (lsb_release -a in Ubuntu): No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.2 LTS Release: 16.04 Codename: xenial

CUDA version (cat /usr/local/cuda/version.txt in most cases): CUDA Version 8.0.61

cuDNN version: (cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2)

define CUDNN_MAJOR 5

define CUDNN_MINOR 1

define CUDNN_PATCHLEVEL 10

GPU model (nvidia-smi in Ubuntu): +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000:00:04.0 Off | 0 | | N/A 29C P8 28W / 149W | 15MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | 0000:00:05.0 Off | 0 | | N/A 29C P8 28W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 1976 G /usr/lib/xorg/Xorg 15MiB | +-----------------------------------------------------------------------------+

Caffe version: Default from OpenPose or custom version. Default from OpenPose.

OpenCV version: 2.4.9.1, installed with apt-get install libopencv-dev (Ubuntu)

Generation mode (only for Ubuntu): Makefile + Makefile.config (default, Ubuntu)

Compiler (gcc --version in Ubuntu): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609

gineshidalgo99 commented 7 years ago

I noted in the following readme that multiple GPU's are for training, only? Is that still the case? --> this is from a 3rd party library, and it is referring to the library.

OpenPose does uses multiple GPUs, and the speed should be approximately twice with 2 GPUs rather than 1.

Are you using a remote server and trying to send the image over Internet? This is normally the main case of OpenPose slowdown when >1 GPUs do not affect speed.

appleweed commented 7 years ago

Yes, I'm using a remote server (Google Compute Instance running Ubuntu), but I'm timing just the process itself by running openpose.bin on the command line. So not including the roundtrip request/response to the server. I haven't tried 3 GPU's yet, but I'm definitely not getting 2x speed from 2 GPU's.

gineshidalgo99 commented 7 years ago

Please check the time and speed up between 1 and 2 GPUs when also using the flag no_display.

appleweed commented 7 years ago

I have been using -no_display. But here are 3 runs again without -num_gpu and then with 1 and 2, specified:

Num gpus not specified ./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0

Starting pose estimation demo. Auto-detecting GPUs... Detected 2 GPU(s), using them all. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.138351 seconds.

1 GPU specified ./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0 --num_gpu 1

Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.469052 seconds.

2 GPUs specified ./build/examples/openpose/openpose.bin --image_dir images --write_keypoint /var/www/html/images -write_keypoint_format xml --keypoint_scale 3 --no_display --render_pose 0 --num_gpu 2

Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.180299 seconds.

So, there is about a 0.3 speed increase with 2 GPUs. Sometimes a little more or less.

Note again there is only one image in the --image_dir. I've tried scaling it down to 640x480 (88k file) to see if that helps. But that doesn't seem to have a noticeable effect.

Thanks for your help, btw!

appleweed commented 7 years ago

As an aside, I did a little searching on GPU performance on Google Compute instances and there seems to be some talk that speed is noticeably slower. (See link below.) I want to compare to a local desktop next week and possibly an AWS instance. (Having a cloud option is desirable for a number of reasons.)

https://stackoverflow.com/questions/44804982/google-compute-engine-tesla-k80-has-additional-htod-an-dtoh-ops-and-a-way-lower

gineshidalgo99 commented 7 years ago

But OpenPose tries to use all the GPU computation (since Caffe does), so then I would say it is the server the one limiting the computation somehow. I do not use remote servers, but let me know if you find a solution so I can add it to the doc. Thanks!

appleweed commented 7 years ago

Will do! I should know more this week.

appleweed commented 7 years ago

Update: I'm now running openpose.bin on an AWS EC2 instance with 8 NVIDIA Tesla K80's. I do not see a performance gain from running more than one GPU. But, overall, it is running much faster on AWS than GCP:

Single image. 1 GPU. No image rendering or display. ./build/examples/openpose/openpose.bin --image_dir images --write_images render --write_keypoint render -write_keypoint_format xml --keypoint_scale 3 --no_display --num_gpu 1

Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 2.823287 seconds.

2 GPU's ./build/examples/openpose/openpose.bin --image_dir images --write_keypoint render --write_keypoint_format xml --no_display --render_pose 0 --num_gpu 2

Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 3.188718 seconds.

3 GPU's ./build/examples/openpose/openpose.bin --image_dir images --write_keypoint render --write_keypoint_format xml --no_display --render_pose 0 --num_gpu 3

Starting pose estimation demo. Starting thread(s) Real-time pose estimation demo successfully finished. Total time: 4.052522 seconds.

Is there significant startup time relative to the number of GPU's? Maybe this is offsetting overall performance? Or, maybe, a better test would be leveraging the core library and not openpose.bin?

At any rate, for my current purposes, < 3 secs is not bad. Cheers!

gineshidalgo99 commented 7 years ago

There is an starting time per GPU, so greater GPUs equals greater initial time to start.

But after it has been started, n GPUs should be a n-times speed up. (at least up to 4 GPUs, I've never tried more than that but I assume it should be linear speed up until 6-7 GPUs)

liaowang0125 commented 7 years ago

If I detect pose on video or image dir,how to get the running time per frame? Another question is that, can I choose to show centain keypoints in body or hand? For example,I don't need to see keypoints like knee,ankle,hip. I find --part_to_show can choose only heatmap.

guiyuliu commented 7 years ago

hello Did you use a video to test the speed ? I use one 1080 GPU and the speed is about 8fps, will it be accelerated when I use 2 GPUs ?

liaowang0125 commented 7 years ago

@guiyuliu I can only get the total time with --no_display,how to get the fps?

gineshidalgo99 commented 7 years ago

@guiyuliu Actually you are right, it is not possible to get the FPS without visualization unless you change the source code.

You can still do this: Run a 10-image folder Run a 110-image folder 1 / FPS = Time / image = (Time_110-image - time_10-image) / 100

(so u remove the time of opening and closing the GPUs) (And repeat this operation 2-3 times, if the times change a lot between them, definitively your server is running other things in the background and hence the bottleneck)

guiyuliu commented 7 years ago

@gineshidalgo99 you mean if I want to get 2 times the speed ,except using 2 GPUs, I should also change source code ? this source code dosen't support multiple GPUs? BTW ,I'm dealing with long videos ,not images ,so I don't consider the time of opening and closing GPUs looking forward to your answer

gineshidalgo99 commented 7 years ago

@guiyuliu No! I explained it wrongly. I meant that there is no code in the library to check the FPS without visualizing it. As alternative, you can use the small math formulas I typed before.

The code includes the GPU parallelization, so n GPUs should be around n times faster. (at least up to 4-5 GPUs). I found that is not the case in some Amazon servers, maybe because Amazon somehow limits the maximum GPU/CPU computation.

gineshidalgo99 commented 7 years ago

Given that it's an old topic, and the problem was most probably the server not OpenPose, I'll close this. Feel free to post.

appleweed commented 7 years ago

Yes, apologies I haven't replied recently! I was able to take the asynchronous example and create a background process that monitors a folder for new images and then processes them to an output folder. New images are cleared out when processed.

This allows the initial load to occur once and the rest is strictly OpenPose processing, which indeed is fast and seems to take advantage of multiple GPUs. I've tested this on an AWS EC2 instance and the results are good. Thank you for your help!

gineshidalgo99 commented 7 years ago

Smart idea, thanks for the feed-back!

makeitraina commented 6 years ago

@appleweed any chance you could share the background process wrapper that you've written? Im trying to prototype a similar wrapper.

appleweed commented 6 years ago

@makeitraina Here ya go: https://github.com/appleweed/OpenPose-Background-Process

The README describes everything. Let me know if you have any questions.

moncio commented 4 years ago

@appleweed is possible to do that with Python? My question is, the performance would be the same?

aditya15081990 commented 4 years ago

I am trying to run openpose body keypoints detections through python API using it's built version. The code is running fine without any error. But the FPS achieved using python code is low compared to FPS achieved using openpose demo executable. I am running both on the same video , the same OS ( windows 10) and the same GPU H/W ( GTX 1080 ti, 11 GB memory ). FPS using demo executable is ~24 FPS and using pythomn code is ~17FPS. Is this a expected behaviour ?

moncio commented 4 years ago

Please @gineshidalgo99 , reopen this thread and look at the comment of @aditya15081990 (https://github.com/CMU-Perceptual-Computing-Lab/openpose/issues/213#issuecomment-607717388). I have the same issue. Thank you!

frankier commented 4 years ago

@moncio Python will be slower if you use asynchronous and feed in an image at a time because it serialises the whole read/OpenPose pipeline/output kps process for each image. Depending on your use case you might be able to use synchronous input, which should be faster. This is either Synchronous or AsynchronousOut. I made a patch to enable AsynchronousOut here: https://github.com/CMU-Perceptual-Computing-Lab/openpose/pull/1593

moncio commented 4 years ago

Hello @frankier, first of all thank you so much for your reply and explanation. I cloned your repo, do you have some "code demo" using this Synchronous mode? I'd need for [https://github.com/frankier/openpose/blob/python-api-async-out/examples/tutorial_api_python/05_keypoints_from_images_multi_gpu.py] , this is my use case. I want to compare the normal solution made right now by openpose authors and the yours one.

Thank you!

frankier commented 4 years ago

Pass ThreadManagerMode.AsynchronousOut to WrapperPython and then pass the image_dir flag rather than passing in the images manually via datums. You could start from this https://github.com/frankier/skelshop/blob/7f289605994ab6a10e41caf80030428cf6eebd6e/skeldump/openpose.py#L64 and then add in the multi gpu flag. It should improve the FPS to the same as openpose.bin -- but I haven't tested it with multiple GPUs so please let me know either way, and if my PR helps please let the OpenPose authors know in the PR discussion!

moncio commented 4 years ago

Sorry but, when I launch my script, it returns to me the following:

Error: Not available for this ThreadManagerMode.

Coming from: line 50, in start op_wrapper.waitAndEmplace(op.VectorDatum([datum]))

This is my script:

def __init__(self, source):
    self.source = source

    self.op_params = self.set_openpose_params()
    self.num_gpus = self.op_params["num_gpu"] \
        if "num_gpu" in self.op_params and \
           self.op_params["num_gpu"] != -1 \
        else op.get_gpu_number()

def start(self):
    start = time.time()

    images_path = op.get_images_on_directory(self.source)

    # Starting OpenPose
    op_wrapper = op.WrapperPython(op.ThreadManagerMode.AsynchronousOut)
    op_wrapper.configure(self.op_params)
    op_wrapper.start()

    # Process and display images
    for image_base_id in range(0, len(images_path), self.num_gpus):
        images = []

        # Read and push images into OpenPose wrapper.append(detections)
        for gpuId in range(0, self.num_gpus):
            image_id = image_base_id + gpuId

            if image_id < len(images_path):
                frame = cv2.imread(images_path[image_base_id + gpuId])
                datum = op.Datum()
                images.append(frame)
                datum.cvInputData = images[-1]
                op_wrapper.waitAndEmplace(op.VectorDatum([datum]))

        # Retrieve processed results from OpenPose wrapper
        for gpuId in range(0, self.num_gpus):

            image_id = image_base_id + gpuId
            if image_id < len(images_path):

                datums = op.VectorDatum()
                op_wrapper.waitAndPop(datums)
                datum = datums[0]

                print("Extracted pose keypoints of frame number: " + str(datum.id))

    print("***********************************************************************")
    print("Total time processing: {0:.2f}".format(time.time() - start))

def set_openpose_params(self):
    params = dict()
    params["model_folder"] = "models_path"

    params["model_pose"] = "BODY_25"
    params["number_people_max"] = 4

    params["render_threshold"] = 0.5

    params["face"] = True

    params["image_dir"] = self.source

    return params

So, it's okay?

frankier commented 4 years ago

In this mode, OpenPose manages taking images from the dir and assigning them to GPUs itself. So you don't need to place them in its queue. i.e. delete:

    # Read and push images into OpenPose wrapper.append(detections)
    for gpuId in range(0, self.num_gpus):
        image_id = image_base_id + gpuId

        if image_id < len(images_path):
            frame = cv2.imread(images_path[image_base_id + gpuId])
            datum = op.Datum()
            images.append(frame)
            datum.cvInputData = images[-1]
            op_wrapper.waitAndEmplace(op.VectorDatum([datum]))
moncio commented 4 years ago

Yes, removing this part of the code, I observe the fps improves (~2 fps higher) than normal version and the total time of processing decreases (more or less the same time than the demo). Thank you so much, I think is a great advance for the community and only congrats to you for this amazing work. I really recommend to the authors (@gineshidalgo99) take a look to this thread and approves the pull-request for a new feature of the library.

moncio commented 4 years ago

One minimal question @frankier , how do you load the input image to the current Datum if you don't emplace the object to the vector?

frankier commented 4 years ago

OpenPose should be loading it itself using the image_dir parameter.

moncio commented 4 years ago

yes, but in the case you want to process a video or stream? I know we have video and ip_camera for both params. But in the case you need to do a simple preprocessing in the input image of the video or the stream? how do you handle that?

frankier commented 4 years ago

Just add stuff to the params dict in the same format as you use to openpose.bin

frankier commented 4 years ago

Oh right. In that case you're out of luck as far as I understand. You can of course output a directory of images beforehand and then pass that in, but otherwise there's always the possibility of bottlenecking it at the input.