Open jgyy4775 opened 4 years ago
You can expect about 10-12 frames a second. So no it is not real time! There are ways to make it faster, but it wont get to real time.
Not real time?? How was the video uploaded to YouTube taken? Did you use pre-recorded video as input to this model and output the results? Not real-time?...
You are correct, it was pre-processed!
@mintar || @TontonTremblay || @sbirchfield So, the package can process 10-12 frames a second to produce 10-12 pose estimates a second? i.e., ~10Hz, which seems decent. However, it seems that it is much slower than this, using only the soup can I am only getting about 1.5 Hz. , is the PnP algorithm
[INFO] [1579205878.868495]: network time elapsed = 0.0372869968414
[INFO] [1579205879.379344]: PnP algorithm time elapsed = 0.54815196991
Q1: Is this the expected bottleneck? Or and thoughts as to why PnP is chocking so badly for me?
Q2: On another front there is a buffering issue for me. Looking at this video it seems that my DOPE is running much much slower, about 10 second lag.
To address this, I was trying to only process the latest information, so I set the queue_size=1
in the callback that processes each image. However, it seems that there is still a queue and the lag is still around 10 seconds, and it is processing a bunch of intermediate frames, which I would think should be discarded. Any ideas why this would not neglect old frames?
P.S.
Ok, so I fixed the buffering issue (Q2) by adjusting the buff_size
rospy.Subscriber(rospy.get_param('~topic_camera'), ImageSensor_msg, self.callback, queue_size = 1, buff_size = 52428800)
So, I just have Q1 left
I just watched your video and your definitely getting performance that are lower than mine. Please check this video I just made (I am running on a new object). https://drive.google.com/file/d/1vO9pd737WsWD1gAQUGW8Eu8Xf-H6LZox/view?usp=sharing
I am running my setting on a 2080ti. Could you share with me the snippet of code you use to check the performance. PnP should be very quick, I am not sure why you are getting that sort of behaviour.
Thank you for playing around with DOPE. :P
Sorry, that was not my video, I copied it from another issue mine is much much worse, I am actually back to seeing ~4 second delays now that I increased the downscale_image parameter back to 500. So, my Q2 remains.
Wow!! That looks great! Thank you for uploading that for me!!
This is the code where I check performance
class ObjectDetector(object):
'''This class contains methods for object detection'''
@staticmethod
def detect_object_in_image(net_model, pnp_solver, in_img, config):
'''Detect objects in a image using a specific trained network model'''
to = rospy.get_time()
if in_img is None:
return []
# Run network inference
image_tensor = transform(in_img)
image_torch = Variable(image_tensor).cuda().unsqueeze(0)
out, seg = net_model(image_torch)
vertex2 = out[-1][0]
aff = seg[-1][0]
rospy.loginfo("network time elapsed = %s ", rospy.get_time() - to)
# Find objects from network output
detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config)
rospy.loginfo("PnP algorithm time elapsed = %s ", rospy.get_time() - to)
return detected_objects
Of course, thank you for developing it in such an elegant and user friendly way! I am excited to see that dope has a lot of potential once I figure out my troubles!
Also, setting the queue_size for the publisher does not seem to change anything
Your second log has a lot of things going on in there, this is where the the keypoints are found and then where pnp is computed. Could you print vertex2.shape
and print image_torch.shape
? I am not sure what is going on on your end.
Thanks you for the quick reply!
Loading DOPE model '/root/visuotactile_pose_estimation_ws/src/pose_estimator/weights/soup_18.pth'...
Model loaded in 8.31219816208 seconds.
Loading DOPE model '/root/visuotactile_pose_estimation_ws/src/pose_estimator/weights/mustard_15.pth'...
Model loaded in 3.87688589096 seconds.
Running DOPE...
Listening to camera topic: '/kinect2_1/hd/image_color_rect')
Listening to camera info topic: '/dope/webcam/camera_info')
Ctrl-C to stop
[INFO] [1579291163.855070]: DELAY = 0
[INFO] [1579291163.929532]: network time elapsed = 0.0613830089569
[INFO] [1579291164.912217]: PnP algorithm time elapsed = 1.04412794113
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291164.962395]: network time elapsed = 0.0487589836121
[INFO] [1579291165.950094]: PnP algorithm time elapsed = 1.03644800186
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291165.969721]: DELAY = 2
[INFO] [1579291166.023808]: network time elapsed = 0.0455369949341
[INFO] [1579291167.015221]: PnP algorithm time elapsed = 1.03698301315
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291167.063486]: network time elapsed = 0.0469698905945
[INFO] [1579291168.055643]: PnP algorithm time elapsed = 1.03913998604
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
[INFO] [1579291168.073561]: DELAY = 5
[INFO] [1579291168.126777]: network time elapsed = 0.0451209545135
[INFO] [1579291169.129803]: PnP algorithm time elapsed = 1.04817199707
('vertex2.shape = ', (9, 62, 111))
('image_torch.shape = ', (1, 3, 500, 888))
where DELAY
is the difference between the image timestamp and the current time, which I compute in a modified version of image_callback
called callback
, where I hardcode the camera parameters
def callback(self, image_msg):
# for debugging
now = rospy.get_rostime()
rospy.loginfo("DELAY = %s ", now.secs - image_msg.header.stamp.secs )
which is related to Q1
as you suggested, the pnp algorithm is not the bottleneck, it is the find_objects function
[INFO] [1579291824.851107]: find_object_poses time elapsed = 0.989943027496
[INFO] [1579291824.852893]: solve_pnp time elapsed = 0.00126791000366
where time is calculated as
def find_object_poses(vertex2, aff, pnp_solver, config):
'''Detect objects given network output'''
to = rospy.get_time()
# Detect objects from belief maps and affinities
objects, all_peaks = ObjectDetector.find_objects(vertex2, aff, config)
rospy.loginfo("find_object_poses time elapsed = %s ", rospy.get_time() - to)
detected_objects = []
obj_name = pnp_solver.object_name
to = rospy.get_time()
for obj in objects:
# Run PNP
points = obj[1] + [(obj[0][0]*8, obj[0][1]*8)]
cuboid2d = np.copy(points)
location, quaternion, projected_points = pnp_solver.solve_pnp(points)
rospy.loginfo("solve_pnp time elapsed = %s ", rospy.get_time() - to)
# Save results
detected_objects.append({
'name': obj_name,
'location': location,
'quaternion': quaternion,
'cuboid2d': cuboid2d,
'projected_points': projected_points,
'score': obj[-1],
})
return detected_objects
Can you try with image size of 400? Have you tried running the provided weights? I think the problem comes from the fact that your weights are not well trained and in one image there are a lot of points that the algorithm finds, so this causes the heuristic to take a while. Can you visualize the belief maps? I am not sure I provided code that, so here can you pass the vertex2
map to this function.
def ShowBelief(tensor,name='tmp'):
for j in range(tensor.size()[0]):
belief = tensor[j].clone()
belief -= float(torch.min(belief)[0].data.cpu().numpy())
belief /= float(torch.max(belief)[0].data.cpu().numpy())
belief = m(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data
belief = torch.clamp(belief,0,1)
belief = torch.cat([belief.unsqueeze(0),belief.unsqueeze(0),belief.unsqueeze(0)]).unsqueeze(0)
temp = Variable(belief.clone())
cv2.imwrite('belief_{}_v{}.png'.format(str(name),str(j)),temp.data.squeeze().cpu().numpy().transpose(1,2,0)*255)
This will create a bunch of pngs that should look like this:
Have you tried running the provided weights?
find_object_poses
function
[INFO] [1579298280.609958]: find_object_poses time elapsed = 1.13880300522
where the output is
Can you try with image size of 400?
Does this mean set downscale_height
to 400 ? Assuming the answer is yes, when I reduce downscale_height it speeds up some
[INFO] [1579298677.286471]: find_object_poses time elapsed = 0.779462099075
but, similar to this issue, the result does not look too good
Can you visualize the belief maps?
m()
[ERROR] [1579297941.132004]: bad callback: <bound method DopeNode.callback of <__main__.DopeNode object at 0x7f3bae271450>>
Traceback (most recent call last):
File "/opt/ros/kinetic/lib/python2.7/dist-packages/rospy/topics.py", line 750, in _invoke_callback
cb(msg)
File "/root/visuotactile_pose_estimation_ws/src/pose_estimator/nodes/dope", line 265, in callback
self.config_detect
File "/root/visuotactile_pose_estimation_ws/src/pose_estimator/src/dope/inference/detector.py", line 275, in detect_object_in_image
ShowBelief(vertex2)
File "/root/visuotactile_pose_estimation_ws/src/pose_estimator/src/dope/inference/detector.py", line 44, in ShowBelief
belief = m(belief.unsqueeze(0).unsqueeze(0)).squeeze().squeeze().data
NameError: global name 'm' is not defined
Can you please provide me with m()
Thanks!
Comment out the m function, this is a upscale function.
Comments
image_width
and image_height
parameters, defined in camera_info.yaml
are not actually usedimage_width: 1080
image_height: 1920
Here i think we should check and pick the max between image_width
and image_height
. something like
if width > height:
scaling_factor = float(self.downscale_height) / width
else:
scaling_factor = float(self.downscale_height) / height
in which case, downscale_height
should probably be changed to downscale_image
or something
However, when I scale down according to the width it is still slow, I think maybe the edges of the image should be cropped since they have that fisheye type view
The image has to be downscale where the smallest length is around 400. I dont remember where this happens. ('image_torch.shape = ', (1, 3, 500, 888))
where 500 should 400. I am not sure where you are getting this slow. I am running out of ideas. The belief maps look very good to me, thus the algorithm should in par with what I have.
Also in the image you shared there is a lot of fish eye effect. Is that normal for kinect?
Yes, I think the fisheye effect is normal on such wide cameras like go pro.
Thanks for your help, I will try to downscale the image
yes, the downscale_height
parameter downscales the image
('image_torch.shape = ', (1, 3, 400, 711))
which again, helps Q2, but makes it inaccurate
however, this algorithm is still slow
[INFO] [1579302527.075795]: find_object_poses time elapsed = 0.67188000679
I think you will have to go down the rabbit hole at this moment and find where it is taking the longest. Which GPU are you running it on?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.82 Driver Version: 396.82 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P1000 Off | 00000000:02:00.0 On | N/A |
| 79% 82C P0 N/A / N/A | 3425MiB / 4031MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TITAN Off | 00000000:03:00.0 Off | N/A |
| 30% 38C P8 13W / 250W | 12MiB / 6083MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
but it seems to be the python code that is choking
Yeah once you start looking into the details. I am sorry about this, I wish I could help more.
No worries! You have helped me a bunch today and I greatly appreciate it!
I am going to try and profile the find_objects
function now. Do you mind checking how long find_objects
takes for you?
I have narrowed down the bottleneck further, it seems that on the first time that the cpu() function is ran it takes significant time to cache the function
[INFO] [1579306446.700360]: tl = 0.920029163361
[INFO] [1579306446.702060]: tl = 8.70227813721e-05
[INFO] [1579306446.703345]: tl = 5.79357147217e-05
[INFO] [1579306446.704589]: tl = 6.8187713623e-05
[INFO] [1579306446.705921]: tl = 6.41345977783e-05
[INFO] [1579306446.707309]: tl = 7.48634338379e-05
[INFO] [1579306446.708583]: tl = 5.29289245605e-05
[INFO] [1579306446.709839]: tl = 5.19752502441e-05
[INFO] [1579306446.711160]: tl = 6.89029693604e-05
where profile code is
tl = rospy.get_time()
map_ori = belief.cpu().data.numpy()
rospy.loginfo("tl = %s ", rospy.get_time() - tl)
this issue is discussed here. After throwing in
torch.cuda.synchronize() # wait for mm to finish
above the for loop
The first time is not stuck waiting on the GPU to finish calculating
[INFO] [1579307137.226423]: tl = 0.000190019607544
[INFO] [1579307137.228368]: tl = 8.58306884766e-05
[INFO] [1579307137.229756]: tl = 6.41345977783e-05
[INFO] [1579307137.231103]: tl = 5.29289245605e-05
[INFO] [1579307137.232483]: tl = 5.41210174561e-05
[INFO] [1579307137.233843]: tl = 5.41210174561e-05
[INFO] [1579307137.235158]: tl = 5.3882598877e-05
[INFO] [1579307137.236510]: tl = 8.32080841064e-05
[INFO] [1579307137.238504]: tl = 0.000121116638184
So, even thought the CNN is not done calculating it continues to run the code until a point where it needs to use the result. Thus, I think the bottleneck is the CNN.
Does this make sense?
This can be seen if the following code is used
def detect_object_in_image(net_model, pnp_solver, in_img, config):
'''Detect objects in a image using a specific trained network model'''
to = rospy.get_time()
if in_img is None:
return []
# Run network inference
image_tensor = transform(in_img)
image_torch = Variable(image_tensor).cuda().unsqueeze(0)
out, seg = net_model(image_torch)
vertex2 = out[-1][0]
aff = seg[-1][0]
torch.cuda.synchronize() # wait for mm to finish
rospy.loginfo("network time elapsed = %s ", rospy.get_time() - to)
# Find objects from network output
detected_objects = ObjectDetector.find_object_poses(vertex2, aff, pnp_solver, config)
return detected_objects
which produces
[INFO] [1579308425.065107]: network time elapsed = 1.07668304443
[INFO] [1579308425.074873]: find_objects time elapsed = 0.00894594192505
[INFO] [1579308425.076703]: solve_pnp time elapsed = 0.00133299827576
It seems that the code was only running on my GPU 0, which has a run time of
[INFO] [1579309331.635676]: network time elapsed = 0.941685914993
after adding
import os
os.environ['CUDA_VISIBLE_DEVICES']='1'
to detector.py
I can cut the time in half
[INFO] [1579309151.256915]: network time elapsed = 0.521842956543
this also mitigates Q2 but does NOT resolve it, I am still seeing up to a 1-2 second difference between the current time and the images that I am processing
not sure why when I put
os.environ['CUDA_VISIBLE_DEVICES']='0,1'
it gives me the slow time, I would assume it would use both GPUs to attain a faster speed
Think I got it, if I crop down my image
img = img[0:1080, 700:1220]
both issues are largely mitigated and the network runs at about 4 Hz
[INFO] [1579311237.426187]: network time elapsed = 0.23420214653
I think you've got it figured out, but I'd still like to provide a few clarifying comments. Here is what I think is going on:
This leaves you with the following options:
By the way, since the Kinect2 has such a large field of view, the effect you are seeing is exaggerated (since a large field of view means that objects appear smaller in the image).
And since you mentioned it earlier: The camera_info.yaml is only used if you use the simplified dope_webcam
node as a camera driver. DOPE itself correctly uses the proper camera_info
topic instead.
I tried this, but it was too slow. Is not detecting objects in real time? My computer has the same performance as yours. Did I do something wrong? (And I used Kinect.)