Why is it so slow? - Githubissues

jgyy4775 commented 4 years ago

I tried this, but it was too slow. Is not detecting objects in real time? My computer has the same performance as yours. Did I do something wrong? (And I used Kinect.)

TontonTremblay commented 4 years ago

You can expect about 10-12 frames a second. So no it is not real time! There are ways to make it faster, but it wont get to real time.

jgyy4775 commented 4 years ago

Not real time?? How was the video uploaded to YouTube taken? Did you use pre-recorded video as input to this model and output the results? Not real-time?...

TontonTremblay commented 4 years ago

You are correct, it was pre-processed!

huckl3b3rry87 commented 4 years ago

@mintar || @TontonTremblay || @sbirchfield So, the package can process 10-12 frames a second to produce 10-12 pose estimates a second? i.e., ~10Hz, which seems decent. However, it seems that it is much slower than this, using only the soup can I am only getting about 1.5 Hz. , is the PnP algorithm

[INFO] [1579205878.868495]: network time elapsed = 0.0372869968414 
[INFO] [1579205879.379344]: PnP algorithm time elapsed = 0.54815196991

Q1: Is this the expected bottleneck? Or and thoughts as to why PnP is chocking so badly for me?

Q2: On another front there is a buffering issue for me. Looking at this video it seems that my DOPE is running much much slower, about 10 second lag.

To address this, I was trying to only process the latest information, so I set the queue_size=1 in the callback that processes each image. However, it seems that there is still a queue and the lag is still around 10 seconds, and it is processing a bunch of intermediate frames, which I would think should be discarded. Any ideas why this would not neglect old frames?

P.S.

I am using a Kinect camera.
Thanks and amazing work!

huckl3b3rry87 commented 4 years ago

Ok, so I fixed the buffering issue (Q2) by adjusting the buff_size

 rospy.Subscriber(rospy.get_param('~topic_camera'), ImageSensor_msg, self.callback, queue_size = 1, buff_size = 52428800)

So, I just have Q1 left

TontonTremblay commented 4 years ago

I just watched your video and your definitely getting performance that are lower than mine. Please check this video I just made (I am running on a new object). https://drive.google.com/file/d/1vO9pd737WsWD1gAQUGW8Eu8Xf-H6LZox/view?usp=sharing

I am running my setting on a 2080ti. Could you share with me the snippet of code you use to check the performance. PnP should be very quick, I am not sure why you are getting that sort of behaviour.

Thank you for playing around with DOPE. :P

huckl3b3rry87 commented 4 years ago

Sorry, that was not my video, I copied it from another issue mine is much much worse, I am actually back to seeing ~4 second delays now that I increased the downscale_image parameter back to 500. So, my Q2 remains.

Wow!! That looks great! Thank you for uploading that for me!!

This is the code where I check performance

class ObjectDetector(object):
    '''This class contains methods for object detection'''

    @staticmethod
    def detect_object_in_image(net_model, pnp_solver, in_img, config):
        '''Detect objects in a image using a specific trained network model'''
        to = rospy.get_time()

        if in_img is None:
            return []

        # Run network inference
        image_tensor = transform(in_img)
        image_torch = Variable(image_tensor).cuda().unsqueeze(0)
        out, seg = net_model(image_torch)
        vertex2 = out[-1][0]
        aff = seg[-1][0]
        rospy.loginfo("network time elapsed = %s ", rospy.get_time() - to)

        # Find objects from network output
        detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config)
        rospy.loginfo("PnP algorithm time elapsed = %s ", rospy.get_time() - to)

        return detected_objects

Of course, thank you for developing it in such an elegant and user friendly way! I am excited to see that dope has a lot of potential once I figure out my troubles!

huckl3b3rry87 commented 4 years ago

Also, setting the queue_size for the publisher does not seem to change anything

TontonTremblay commented 4 years ago

Your second log has a lot of things going on in there, this is where the the keypoints are found and then where pnp is computed. Could you print vertex2.shape and print image_torch.shape? I am not sure what is going on on your end.

huckl3b3rry87 commented 4 years ago

Thanks you for the quick reply!

Loading DOPE model '/root/visuotactile_pose_estimation_ws/src/pose_estimator/weights/soup_18.pth'...
    Model loaded in 8.31219816208 seconds.
Loading DOPE model '/root/visuotactile_pose_estimation_ws/src/pose_estimator/weights/mustard_15.pth'...
    Model loaded in 3.87688589096 seconds.
Running DOPE...
Listening to camera topic: '/kinect2_1/hd/image_color_rect')
Listening to camera info topic: '/dope/webcam/camera_info')
Ctrl-C to stop
[INFO] [1579291163.855070]: DELAY = 0 
[INFO] [1579291163.929532]: network time elapsed = 0.0613830089569 
[INFO] [1579291164.912217]: PnP algorithm time elapsed = 1.04412794113 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291164.962395]: network time elapsed = 0.0487589836121 
[INFO] [1579291165.950094]: PnP algorithm time elapsed = 1.03644800186 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291165.969721]: DELAY = 2 
[INFO] [1579291166.023808]: network time elapsed = 0.0455369949341 
[INFO] [1579291167.015221]: PnP algorithm time elapsed = 1.03698301315 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291167.063486]: network time elapsed = 0.0469698905945 
[INFO] [1579291168.055643]: PnP algorithm time elapsed = 1.03913998604 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291168.073561]: DELAY = 5 
[INFO] [1579291168.126777]: network time elapsed = 0.0451209545135 
[INFO] [1579291169.129803]: PnP algorithm time elapsed = 1.04817199707 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))

where DELAY is the difference between the image timestamp and the current time, which I compute in a modified version of image_callback called callback, where I hardcode the camera parameters

    def callback(self, image_msg):
        # for debugging
        now = rospy.get_rostime()
        rospy.loginfo("DELAY = %s ", now.secs - image_msg.header.stamp.secs )

which is related to Q1

huckl3b3rry87 commented 4 years ago

as you suggested, the pnp algorithm is not the bottleneck, it is the find_objects function

[INFO] [1579291824.851107]: find_object_poses time elapsed = 0.989943027496 
[INFO] [1579291824.852893]: solve_pnp time elapsed = 0.00126791000366

where time is calculated as

    def find_object_poses(vertex2, aff, pnp_solver, config):
        '''Detect objects given network output'''
        to = rospy.get_time()

        # Detect objects from belief maps and affinities
        objects, all_peaks = ObjectDetector.find_objects(vertex2, aff, config)
        rospy.loginfo("find_object_poses time elapsed = %s ", rospy.get_time() - to)

        detected_objects = []
        obj_name = pnp_solver.object_name

        to = rospy.get_time()
        for obj in objects:
            # Run PNP
            points = obj[1] + [(obj[0][0]*8, obj[0][1]*8)]
            cuboid2d = np.copy(points)
            location, quaternion, projected_points = pnp_solver.solve_pnp(points)
            rospy.loginfo("solve_pnp time elapsed = %s ", rospy.get_time() - to)

            # Save results
            detected_objects.append({
                'name': obj_name,
                'location': location,
                'quaternion': quaternion,
                'cuboid2d': cuboid2d,
                'projected_points': projected_points,
                'score': obj[-1],
            })

        return detected_objects

TontonTremblay commented 4 years ago

Can you try with image size of 400? Have you tried running the provided weights? I think the problem comes from the fact that your weights are not well trained and in one image there are a lot of points that the algorithm finds, so this causes the heuristic to take a while. Can you visualize the belief maps? I am not sure I provided code that, so here can you pass the vertex2 map to this function.

def ShowBelief(tensor,name='tmp'):
    for j in range(tensor.size()[0]):

        belief = tensor[j].clone()
        belief -= float(torch.min(belief)[0].data.cpu().numpy())
        belief /= float(torch.max(belief)[0].data.cpu().numpy())  

        belief = m(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data 
        belief = torch.clamp(belief,0,1)  
        belief = torch.cat([belief.unsqueeze(0),belief.unsqueeze(0),belief.unsqueeze(0)]).unsqueeze(0)

        temp = Variable(belief.clone())
        cv2.imwrite('belief_{}_v{}.png'.format(str(name),str(j)),temp.data.squeeze().cpu().numpy().transpose(1,2,0)*255)

This will create a bunch of pngs that should look like this:

huckl3b3rry87 commented 4 years ago

Have you tried running the provided weights?

@TontonTremblay Yes, I just did and the it is still choking in the find_object_poses function
```
[INFO] [1579298280.609958]: find_object_poses time elapsed = 1.13880300522 
```
where the output is

Can you try with image size of 400?

Does this mean set downscale_height to 400 ? Assuming the answer is yes, when I reduce downscale_height it speeds up some
```
[INFO] [1579298677.286471]: find_object_poses time elapsed = 0.779462099075 
```
but, similar to this issue, the result does not look too good

Can you visualize the belief maps?

I tried, but it seems that I am missing the m()

[ERROR] [1579297941.132004]: bad callback: <bound method DopeNode.callback of <__main__.DopeNode object at 0x7f3bae271450>>
Traceback (most recent call last):
File "/opt/ros/kinetic/lib/python2.7/dist-packages/rospy/topics.py", line 750, in _invoke_callback
cb(msg)
File "/root/visuotactile_pose_estimation_ws/src/pose_estimator/nodes/dope", line 265, in callback
self.config_detect
File "/root/visuotactile_pose_estimation_ws/src/pose_estimator/src/dope/inference/detector.py", line 275, in detect_object_in_image
ShowBelief(vertex2)
File "/root/visuotactile_pose_estimation_ws/src/pose_estimator/src/dope/inference/detector.py", line 44, in ShowBelief
belief = m(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data
NameError: global name 'm' is not defined

Can you please provide me with m()

Thanks!

TontonTremblay commented 4 years ago

Comment out the m function, this is a upscale function.

huckl3b3rry87 commented 4 years ago

test

Comments

looks like mine has a much higher resolution each file is 1.4 kB (1,429 bytes)
also mine images are wider than yours
In the repo, it seem that the image_width and image_height parameters, defined in camera_info.yaml are not actually used
for the kinect, they should be
```
image_width: 1080
image_height: 1920
```
Here i think we should check and pick the max between image_width and image_height. something like
```
    if width > height:
        scaling_factor = float(self.downscale_height) / width
    else:
        scaling_factor = float(self.downscale_height) / height
```
in which case, downscale_height should probably be changed to downscale_image or something

However, when I scale down according to the width it is still slow, I think maybe the edges of the image should be cropped since they have that fisheye type view

TontonTremblay commented 4 years ago

The image has to be downscale where the smallest length is around 400. I dont remember where this happens. ('image_torch.shape = ', (1, 3, 500, 888)) where 500 should 400. I am not sure where you are getting this slow. I am running out of ideas. The belief maps look very good to me, thus the algorithm should in par with what I have.

TontonTremblay commented 4 years ago

Also in the image you shared there is a lot of fish eye effect. Is that normal for kinect?

huckl3b3rry87 commented 4 years ago

Yes, I think the fisheye effect is normal on such wide cameras like go pro.

Thanks for your help, I will try to downscale the image

huckl3b3rry87 commented 4 years ago

yes, the downscale_height parameter downscales the image

('image_torch.shape = ', (1, 3, 400, 711))

which again, helps Q2, but makes it inaccurate can

however, this algorithm is still slow

[INFO] [1579302527.075795]: find_object_poses time elapsed = 0.67188000679

TontonTremblay commented 4 years ago

I think you will have to go down the rabbit hole at this moment and find where it is taking the longest. Which GPU are you running it on?

huckl3b3rry87 commented 4 years ago

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.82                 Driver Version: 396.82                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P1000        Off  | 00000000:02:00.0  On |                  N/A |
| 79%   82C    P0    N/A /  N/A |   3425MiB /  4031MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 00000000:03:00.0 Off |                  N/A |
| 30%   38C    P8    13W / 250W |     12MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

but it seems to be the python code that is choking

info

TontonTremblay commented 4 years ago

Yeah once you start looking into the details. I am sorry about this, I wish I could help more.

huckl3b3rry87 commented 4 years ago

No worries! You have helped me a bunch today and I greatly appreciate it!

I am going to try and profile the find_objects function now. Do you mind checking how long find_objects takes for you?

huckl3b3rry87 commented 4 years ago

I have narrowed down the bottleneck further, it seems that on the first time that the cpu() function is ran it takes significant time to cache the function

[INFO] [1579306446.700360]: tl = 0.920029163361 
[INFO] [1579306446.702060]: tl = 8.70227813721e-05 
[INFO] [1579306446.703345]: tl = 5.79357147217e-05 
[INFO] [1579306446.704589]: tl = 6.8187713623e-05 
[INFO] [1579306446.705921]: tl = 6.41345977783e-05 
[INFO] [1579306446.707309]: tl = 7.48634338379e-05 
[INFO] [1579306446.708583]: tl = 5.29289245605e-05 
[INFO] [1579306446.709839]: tl = 5.19752502441e-05 
[INFO] [1579306446.711160]: tl = 6.89029693604e-05

where profile code is

        tl = rospy.get_time()
       map_ori = belief.cpu().data.numpy()
       rospy.loginfo("tl = %s ", rospy.get_time() - tl)

this issue is discussed here. After throwing in

torch.cuda.synchronize() # wait for mm to finish

above the for loop

The first time is not stuck waiting on the GPU to finish calculating

[INFO] [1579307137.226423]: tl = 0.000190019607544 
[INFO] [1579307137.228368]: tl = 8.58306884766e-05 
[INFO] [1579307137.229756]: tl = 6.41345977783e-05 
[INFO] [1579307137.231103]: tl = 5.29289245605e-05 
[INFO] [1579307137.232483]: tl = 5.41210174561e-05 
[INFO] [1579307137.233843]: tl = 5.41210174561e-05 
[INFO] [1579307137.235158]: tl = 5.3882598877e-05 
[INFO] [1579307137.236510]: tl = 8.32080841064e-05 
[INFO] [1579307137.238504]: tl = 0.000121116638184

So, even thought the CNN is not done calculating it continues to run the code until a point where it needs to use the result. Thus, I think the bottleneck is the CNN.

Does this make sense?

huckl3b3rry87 commented 4 years ago

This can be seen if the following code is used

     def detect_object_in_image(net_model, pnp_solver, in_img, config):
        '''Detect objects in a image using a specific trained network model'''
        to = rospy.get_time()

        if in_img is None:
            return []

        # Run network inference
        image_tensor = transform(in_img)
        image_torch = Variable(image_tensor).cuda().unsqueeze(0)
        out, seg = net_model(image_torch)
        vertex2 = out[-1][0]
        aff = seg[-1][0]
        torch.cuda.synchronize() # wait for mm to finish
        rospy.loginfo("network time elapsed = %s ", rospy.get_time() - to)

        # Find objects from network output
        detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config)

        return detected_objects

which produces

[INFO] [1579308425.065107]: network time elapsed = 1.07668304443 
[INFO] [1579308425.074873]: find_objects time elapsed = 0.00894594192505 
[INFO] [1579308425.076703]: solve_pnp time elapsed = 0.00133299827576

huckl3b3rry87 commented 4 years ago

It seems that the code was only running on my GPU 0, which has a run time of

[INFO] [1579309331.635676]: network time elapsed = 0.941685914993

after adding

import os
os.environ['CUDA_VISIBLE_DEVICES']='1'

to detector.py

I can cut the time in half

[INFO] [1579309151.256915]: network time elapsed = 0.521842956543

this also mitigates Q2 but does NOT resolve it, I am still seeing up to a 1-2 second difference between the current time and the images that I am processing

huckl3b3rry87 commented 4 years ago

not sure why when I put

os.environ['CUDA_VISIBLE_DEVICES']='0,1'

it gives me the slow time, I would assume it would use both GPUs to attain a faster speed

huckl3b3rry87 commented 4 years ago

Think I got it, if I crop down my image

        img = img[0:1080, 700:1220]

both issues are largely mitigated and the network runs at about 4 Hz

[INFO] [1579311237.426187]: network time elapsed = 0.23420214653

mintar commented 4 years ago

I think you've got it figured out, but I'd still like to provide a few clarifying comments. Here is what I think is going on:

The bottleneck is the GPU, not the PnP algorithm.
If you use a large input image without downscaling, it will be extremely slow (and also eat up all your GPU memory). A 1920x1080 image has about 7.3 times as many pixels than a downscaled 711x400 image, so expect it to take much longer.
If the effective resolution on your object (i.e., the number of pixels that correspond to the object) drops too low, pose recognition will become inaccurate.

This leaves you with the following options:

use the full input resolution without downscaling -> slow
downscale the image and move the camera closer to the objects
crop the image to the region where the object is (basically zooming in)

By the way, since the Kinect2 has such a large field of view, the effect you are seeing is exaggerated (since a large field of view means that objects appear smaller in the image).

And since you mentioned it earlier: The camera_info.yaml is only used if you use the simplified dope_webcam node as a camera driver. DOPE itself correctly uses the proper camera_info topic instead.

NVlabs / Deep_Object_Pose

Why is it so slow? #54