NVlabs / Deep_Object_Pose

Deep Object Pose Estimation (DOPE) – ROS inference (CoRL 2018)
Other
988 stars 276 forks source link

Why is it so slow? #54

Open jgyy4775 opened 4 years ago

jgyy4775 commented 4 years ago

I tried this, but it was too slow. Is not detecting objects in real time? My computer has the same performance as yours. Did I do something wrong? (And I used Kinect.)

TontonTremblay commented 4 years ago

You can expect about 10-12 frames a second. So no it is not real time! There are ways to make it faster, but it wont get to real time.

jgyy4775 commented 4 years ago

Not real time?? How was the video uploaded to YouTube taken? Did you use pre-recorded video as input to this model and output the results? Not real-time?...

TontonTremblay commented 4 years ago

You are correct, it was pre-processed!

huckl3b3rry87 commented 4 years ago

@mintar || @TontonTremblay || @sbirchfield So, the package can process 10-12 frames a second to produce 10-12 pose estimates a second? i.e., ~10Hz, which seems decent. However, it seems that it is much slower than this, using only the soup can I am only getting about 1.5 Hz. , is the PnP algorithm

[INFO] [1579205878.868495]: network time elapsed = 0.0372869968414 
[INFO] [1579205879.379344]: PnP algorithm time elapsed = 0.54815196991 

Q1: Is this the expected bottleneck? Or and thoughts as to why PnP is chocking so badly for me?

Q2: On another front there is a buffering issue for me. Looking at this video it seems that my DOPE is running much much slower, about 10 second lag.

To address this, I was trying to only process the latest information, so I set the queue_size=1 in the callback that processes each image. However, it seems that there is still a queue and the lag is still around 10 seconds, and it is processing a bunch of intermediate frames, which I would think should be discarded. Any ideas why this would not neglect old frames?

P.S.

huckl3b3rry87 commented 4 years ago

Ok, so I fixed the buffering issue (Q2) by adjusting the buff_size

 rospy.Subscriber(rospy.get_param('~topic_camera'), ImageSensor_msg, self.callback, queue_size = 1, buff_size = 52428800)

So, I just have Q1 left

TontonTremblay commented 4 years ago

I just watched your video and your definitely getting performance that are lower than mine. Please check this video I just made (I am running on a new object). https://drive.google.com/file/d/1vO9pd737WsWD1gAQUGW8Eu8Xf-H6LZox/view?usp=sharing

I am running my setting on a 2080ti. Could you share with me the snippet of code you use to check the performance. PnP should be very quick, I am not sure why you are getting that sort of behaviour.

Thank you for playing around with DOPE. :P

huckl3b3rry87 commented 4 years ago

Sorry, that was not my video, I copied it from another issue mine is much much worse, I am actually back to seeing ~4 second delays now that I increased the downscale_image parameter back to 500. So, my Q2 remains.

Wow!! That looks great! Thank you for uploading that for me!!

This is the code where I check performance

class ObjectDetector(object):
    '''This class contains methods for object detection'''

    @staticmethod
    def detect_object_in_image(net_model, pnp_solver, in_img, config):
        '''Detect objects in a image using a specific trained network model'''
        to = rospy.get_time()

        if in_img is None:
            return []

        # Run network inference
        image_tensor = transform(in_img)
        image_torch = Variable(image_tensor).cuda().unsqueeze(0)
        out, seg = net_model(image_torch)
        vertex2 = out[-1][0]
        aff = seg[-1][0]
        rospy.loginfo("network time elapsed = %s ", rospy.get_time() - to)

        # Find objects from network output
        detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config)
        rospy.loginfo("PnP algorithm time elapsed = %s ", rospy.get_time() - to)

        return detected_objects

Of course, thank you for developing it in such an elegant and user friendly way! I am excited to see that dope has a lot of potential once I figure out my troubles!

huckl3b3rry87 commented 4 years ago

Also, setting the queue_size for the publisher does not seem to change anything

TontonTremblay commented 4 years ago

Your second log has a lot of things going on in there, this is where the the keypoints are found and then where pnp is computed. Could you print vertex2.shape and print image_torch.shape? I am not sure what is going on on your end.

huckl3b3rry87 commented 4 years ago

Thanks you for the quick reply!

Loading DOPE model '/root/visuotactile_pose_estimation_ws/src/pose_estimator/weights/soup_18.pth'...
    Model loaded in 8.31219816208 seconds.
Loading DOPE model '/root/visuotactile_pose_estimation_ws/src/pose_estimator/weights/mustard_15.pth'...
    Model loaded in 3.87688589096 seconds.
Running DOPE...
Listening to camera topic: '/kinect2_1/hd/image_color_rect')
Listening to camera info topic: '/dope/webcam/camera_info')
Ctrl-C to stop
[INFO] [1579291163.855070]: DELAY = 0 
[INFO] [1579291163.929532]: network time elapsed = 0.0613830089569 
[INFO] [1579291164.912217]: PnP algorithm time elapsed = 1.04412794113 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291164.962395]: network time elapsed = 0.0487589836121 
[INFO] [1579291165.950094]: PnP algorithm time elapsed = 1.03644800186 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291165.969721]: DELAY = 2 
[INFO] [1579291166.023808]: network time elapsed = 0.0455369949341 
[INFO] [1579291167.015221]: PnP algorithm time elapsed = 1.03698301315 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291167.063486]: network time elapsed = 0.0469698905945 
[INFO] [1579291168.055643]: PnP algorithm time elapsed = 1.03913998604 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291168.073561]: DELAY = 5 
[INFO] [1579291168.126777]: network time elapsed = 0.0451209545135 
[INFO] [1579291169.129803]: PnP algorithm time elapsed = 1.04817199707 
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))

where DELAY is the difference between the image timestamp and the current time, which I compute in a modified version of image_callback called callback, where I hardcode the camera parameters

    def callback(self, image_msg):
        # for debugging
        now = rospy.get_rostime()
        rospy.loginfo("DELAY = %s ", now.secs - image_msg.header.stamp.secs )

which is related to Q1

huckl3b3rry87 commented 4 years ago

as you suggested, the pnp algorithm is not the bottleneck, it is the find_objects function

[INFO] [1579291824.851107]: find_object_poses time elapsed = 0.989943027496 
[INFO] [1579291824.852893]: solve_pnp time elapsed = 0.00126791000366 

where time is calculated as

    def find_object_poses(vertex2, aff, pnp_solver, config):
        '''Detect objects given network output'''
        to = rospy.get_time()

        # Detect objects from belief maps and affinities
        objects, all_peaks = ObjectDetector.find_objects(vertex2, aff, config)
        rospy.loginfo("find_object_poses time elapsed = %s ", rospy.get_time() - to)

        detected_objects = []
        obj_name = pnp_solver.object_name

        to = rospy.get_time()
        for obj in objects:
            # Run PNP
            points = obj[1] + [(obj[0][0]*8, obj[0][1]*8)]
            cuboid2d = np.copy(points)
            location, quaternion, projected_points = pnp_solver.solve_pnp(points)
            rospy.loginfo("solve_pnp time elapsed = %s ", rospy.get_time() - to)

            # Save results
            detected_objects.append({
                'name': obj_name,
                'location': location,
                'quaternion': quaternion,
                'cuboid2d': cuboid2d,
                'projected_points': projected_points,
                'score': obj[-1],
            })

        return detected_objects
TontonTremblay commented 4 years ago

Can you try with image size of 400? Have you tried running the provided weights? I think the problem comes from the fact that your weights are not well trained and in one image there are a lot of points that the algorithm finds, so this causes the heuristic to take a while. Can you visualize the belief maps? I am not sure I provided code that, so here can you pass the vertex2 map to this function.

def ShowBelief(tensor,name='tmp'):
    for j in range(tensor.size()[0]):

        belief = tensor[j].clone()
        belief -= float(torch.min(belief)[0].data.cpu().numpy())
        belief /= float(torch.max(belief)[0].data.cpu().numpy())  

        belief = m(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data 
        belief = torch.clamp(belief,0,1)  
        belief = torch.cat([belief.unsqueeze(0),belief.unsqueeze(0),belief.unsqueeze(0)]).unsqueeze(0)

        temp = Variable(belief.clone())
        cv2.imwrite('belief_{}_v{}.png'.format(str(name),str(j)),temp.data.squeeze().cpu().numpy().transpose(1,2,0)*255)

This will create a bunch of pngs that should look like this:

image
huckl3b3rry87 commented 4 years ago

Have you tried running the provided weights?

Can you try with image size of 400?

Thanks!

TontonTremblay commented 4 years ago

Comment out the m function, this is a upscale function.

huckl3b3rry87 commented 4 years ago

test

Comments

However, when I scale down according to the width it is still slow, I think maybe the edges of the image should be cropped since they have that fisheye type view

ex

TontonTremblay commented 4 years ago

The image has to be downscale where the smallest length is around 400. I dont remember where this happens. ('image_torch.shape = ', (1, 3, 500, 888)) where 500 should 400. I am not sure where you are getting this slow. I am running out of ideas. The belief maps look very good to me, thus the algorithm should in par with what I have.

TontonTremblay commented 4 years ago

Also in the image you shared there is a lot of fish eye effect. Is that normal for kinect?

huckl3b3rry87 commented 4 years ago

Yes, I think the fisheye effect is normal on such wide cameras like go pro.

Thanks for your help, I will try to downscale the image

huckl3b3rry87 commented 4 years ago

yes, the downscale_height parameter downscales the image

('image_torch.shape = ', (1, 3, 400, 711))

which again, helps Q2, but makes it inaccurate can

however, this algorithm is still slow

[INFO] [1579302527.075795]: find_object_poses time elapsed = 0.67188000679 
TontonTremblay commented 4 years ago

I think you will have to go down the rabbit hole at this moment and find where it is taking the longest. Which GPU are you running it on?

huckl3b3rry87 commented 4 years ago
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.82                 Driver Version: 396.82                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro P1000        Off  | 00000000:02:00.0  On |                  N/A |
| 79%   82C    P0    N/A /  N/A |   3425MiB /  4031MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TITAN   Off  | 00000000:03:00.0 Off |                  N/A |
| 30%   38C    P8    13W / 250W |     12MiB /  6083MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

but it seems to be the python code that is choking

info

TontonTremblay commented 4 years ago

Yeah once you start looking into the details. I am sorry about this, I wish I could help more.

huckl3b3rry87 commented 4 years ago

No worries! You have helped me a bunch today and I greatly appreciate it!

I am going to try and profile the find_objects function now. Do you mind checking how long find_objects takes for you?

huckl3b3rry87 commented 4 years ago

I have narrowed down the bottleneck further, it seems that on the first time that the cpu() function is ran it takes significant time to cache the function

[INFO] [1579306446.700360]: tl = 0.920029163361 
[INFO] [1579306446.702060]: tl = 8.70227813721e-05 
[INFO] [1579306446.703345]: tl = 5.79357147217e-05 
[INFO] [1579306446.704589]: tl = 6.8187713623e-05 
[INFO] [1579306446.705921]: tl = 6.41345977783e-05 
[INFO] [1579306446.707309]: tl = 7.48634338379e-05 
[INFO] [1579306446.708583]: tl = 5.29289245605e-05 
[INFO] [1579306446.709839]: tl = 5.19752502441e-05 
[INFO] [1579306446.711160]: tl = 6.89029693604e-05 

where profile code is

        tl = rospy.get_time()
       map_ori = belief.cpu().data.numpy()
       rospy.loginfo("tl = %s ", rospy.get_time() - tl)

this issue is discussed here. After throwing in

torch.cuda.synchronize() # wait for mm to finish

above the for loop

The first time is not stuck waiting on the GPU to finish calculating

[INFO] [1579307137.226423]: tl = 0.000190019607544 
[INFO] [1579307137.228368]: tl = 8.58306884766e-05 
[INFO] [1579307137.229756]: tl = 6.41345977783e-05 
[INFO] [1579307137.231103]: tl = 5.29289245605e-05 
[INFO] [1579307137.232483]: tl = 5.41210174561e-05 
[INFO] [1579307137.233843]: tl = 5.41210174561e-05 
[INFO] [1579307137.235158]: tl = 5.3882598877e-05 
[INFO] [1579307137.236510]: tl = 8.32080841064e-05 
[INFO] [1579307137.238504]: tl = 0.000121116638184 

So, even thought the CNN is not done calculating it continues to run the code until a point where it needs to use the result. Thus, I think the bottleneck is the CNN.

Does this make sense?

huckl3b3rry87 commented 4 years ago

This can be seen if the following code is used

     def detect_object_in_image(net_model, pnp_solver, in_img, config):
        '''Detect objects in a image using a specific trained network model'''
        to = rospy.get_time()

        if in_img is None:
            return []

        # Run network inference
        image_tensor = transform(in_img)
        image_torch = Variable(image_tensor).cuda().unsqueeze(0)
        out, seg = net_model(image_torch)
        vertex2 = out[-1][0]
        aff = seg[-1][0]
        torch.cuda.synchronize() # wait for mm to finish
        rospy.loginfo("network time elapsed = %s ", rospy.get_time() - to)

        # Find objects from network output
        detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config)

        return detected_objects

which produces

[INFO] [1579308425.065107]: network time elapsed = 1.07668304443 
[INFO] [1579308425.074873]: find_objects time elapsed = 0.00894594192505 
[INFO] [1579308425.076703]: solve_pnp time elapsed = 0.00133299827576
huckl3b3rry87 commented 4 years ago

It seems that the code was only running on my GPU 0, which has a run time of

[INFO] [1579309331.635676]: network time elapsed = 0.941685914993 

after adding

import os
os.environ['CUDA_VISIBLE_DEVICES']='1'

to detector.py

I can cut the time in half

[INFO] [1579309151.256915]: network time elapsed = 0.521842956543 

this also mitigates Q2 but does NOT resolve it, I am still seeing up to a 1-2 second difference between the current time and the images that I am processing

huckl3b3rry87 commented 4 years ago

not sure why when I put

os.environ['CUDA_VISIBLE_DEVICES']='0,1'

it gives me the slow time, I would assume it would use both GPUs to attain a faster speed

huckl3b3rry87 commented 4 years ago

Think I got it, if I crop down my image

        img = img[0:1080, 700:1220]

both issues are largely mitigated and the network runs at about 4 Hz

[INFO] [1579311237.426187]: network time elapsed = 0.23420214653 
mintar commented 4 years ago

I think you've got it figured out, but I'd still like to provide a few clarifying comments. Here is what I think is going on:

This leaves you with the following options:

By the way, since the Kinect2 has such a large field of view, the effect you are seeing is exaggerated (since a large field of view means that objects appear smaller in the image).

And since you mentioned it earlier: The camera_info.yaml is only used if you use the simplified dope_webcam node as a camera driver. DOPE itself correctly uses the proper camera_info topic instead.