8 points pixel coordinates (3D bounding box)

mhusseinsh commented 5 years ago

Hello @marcgpuig concerning my comment, what I am simply asking, because I am confused in this from this part of code

def _on_render(self):
    if self._main_image is not None:
        array = image_converter.to_rgb_array(self._main_image)
        array.setflags(write=1)
        for agent in self._measurements.non_player_agents:
            if agent.HasField('pedestrian'):
                # get the needed transformations
                # remember to explicitly make it Transform() so you can use transform_points()
                pedestrian_transform = Transform(agent.pedestrian.transform)
                bbox_transform = Transform(agent.pedestrian.bounding_box.transform)

                # get the box extent
                ext = agent.pedestrian.bounding_box.extent
                # 8 bounding box vertices relative to (0,0,0)
                bbox = np.array([
                    [  ext.x,   ext.y,   ext.z],
                    [- ext.x,   ext.y,   ext.z],
                    [  ext.x, - ext.y,   ext.z],
                    [- ext.x, - ext.y,   ext.z],
                    [  ext.x,   ext.y, - ext.z],
                    [- ext.x,   ext.y, - ext.z],
                    [  ext.x, - ext.y, - ext.z],
                    [- ext.x, - ext.y, - ext.z]
                ])

                # transform the vertices respect to the bounding box transform
                bbox = bbox_transform.transform_points(bbox)

                # the bounding box transform is respect to the pedestrian transform
                # so let's transform the points relative to it's transform
                bbox = pedestrian_transform.transform_points(bbox)

                # pedestrian's transform is relative to the world, so now,
                # bbox contains the 3D bounding box vertices relative to the world

                # Additionally, you can print these vertices to check that is working
                for vertex in bbox:
                    pos_vector = np.array([
                        [vertex[0,0]],  # [[X,
                        [vertex[0,1]],  #   Y,
                        [vertex[0,2]],  #   Z,
                        [1.0]           #   1.0]]
                    ])
                    # transform the points to camera
                    transformed_3d_pos = np.dot(inv(self._extrinsic.matrix), pos_vector)
                    # transform the points to 2D
                    pos2d = np.dot(self._intrinsic, transformed_3d_pos[:3])

                    # normalize the 2D points
                    pos2d = np.array([
                        pos2d[0] / pos2d[2],
                        pos2d[1] / pos2d[2],
                        pos2d[2]
                    ])

                    # print the points in the screen
                    if pos2d[2] > 0: # if the point is in front of the camera
                        x_2d = WINDOW_WIDTH - pos2d[0]
                        y_2d = WINDOW_HEIGHT - pos2d[1]
                        draw_rect(array, (y_2d, x_2d), 4, rand_color(agent.id))

        surface = pygame.surfarray.make_surface(array.swapaxes(0, 1))
        self._display.blit(surface, (0, 0))

        [...]

which variable exactly holds the (x,y) coordinates of the 8 points (vertices of the 3D BB) in pixel coordinates, not world coordinates ? So an example saying that the 8 coordinate points of the 3D BB in the current frame will be 8 x pixel points, 8 y pixel points, with respect to (0,0) as the origin of the image (frame) in the upper left corner (the normal origin)

Your help is appreciated, but I am really confused in this point, that's why I thought of relating them to point_0and point_1in the draw_rect() function

marcgpuig commented 5 years ago

@mhusseinsh This line

for vertex in bbox:

iterates all the 8 vertices, x_2d is the horizontal 2D position and y_2d is the vertical 2D position from the upper left corner for each vertex.

mhusseinsh commented 5 years ago

@marcgpuig so I printed bbox before iteration this is it

[[205.23460105  60.24636716  40.84817564]
 [199.86540798  60.24637574  40.84817564]
 [205.23459813  58.41365844  40.84817564]
 [199.86540505  58.41366702  40.84817564]
 [205.23460105  60.24636716  39.43307614]
 [199.86540798  60.24637574  39.43307614]
 [205.23459813  58.41365844  39.43307614]
 [199.86540505  58.41366702  39.43307614]]

these points are not in the image plane, right ?

And here is x_2dand y_2d

[[6.95273012]] [[299.90550021]] but these are float numbers, so they are not pixel coordinates, or am I mistaken ?

marcgpuig commented 5 years ago

@mhusseinsh The first points are in 3D world space. The second points are correct too. These are the exact 2D points, but as your visualizer is a screen that works with pixels, you need to cast them to integers.

mhusseinsh commented 5 years ago

@marcgpuig ok perfect I understand this One last question though In this part

  # normalize the 2D points
                    pos2d = np.array([
                        pos2d[0] / pos2d[2],
                        pos2d[1] / pos2d[2],
                        pos2d[2]
                    ])

we only filter the objects which are in +ve z direction of the car, which means all objects in front of the car in the draw_rect() function there is this part of code

    if point_in_canvas(point_0) and point_in_canvas(point_1):
        for i in range(size):
            for j in range(size):
                array[int(point_0[0]+i), int(point_0[1]+j)] = color

so this is filtering the objects which are only within the image size of WINDOW_HEIGHTand WINDOW_WIDTH ?

So as I understand, first we loop on all non-player agents, then we choose the ones which are vehicles or pedestrians (dependent on what we want), then we filter the ones which are in the +ve z direction, then we filter the ones which are within the the camera frame ?

Correct me if I am wrong

I will try to describe what I want to do, or what is my idea .. I want to use the bounding box coordinates, and compare them with objects available in the Semantic Segmentation frame. So if I have the 8 box coordinates as you said from x_2d and y_2d, there will be already objects which are not available in the scene, but I still have coordinates for them, right ?

marcgpuig commented 5 years ago

@mhusseinsh

So as I understand, first we loop on all non-player agents, then we choose the ones which are vehicles or pedestrians (dependent on what we want), then we filter the ones which are in the +ve z direction, then we filter the ones which are within the the camera frame?

This is correct!

So if I have the 8 box coordinates as you said from x_2d and y_2d, there will be already objects which are not available in the scene, but I still have coordinates for them, right?

If by not available in the scene do you mean that the CENTER is not in the scene, then yes. Maybe you see the front part of a car but the (center) position of that car is not in your image, but you still know that the car is inside because some vertex of the bbox is inside.

mhusseinsh commented 5 years ago

@mhusseinsh

So as I understand, first we loop on all non-player agents, then we choose the ones which are vehicles or pedestrians (dependent on what we want), then we filter the ones which are in the +ve z direction, then we filter the ones which are within the the camera frame?

This is correct!

I thought about something, maybe you can give a hint on this. Instead of doing all these filters for object neglecting or considering, why don't we just filter the objects which are in the camera FOV. Is this possible ? or doable ? I mean I thought, we already know the camera FOVwhich should be 90, and we have the RotationPitch, RotationYaw, RotationRollset to (0,0,0) .. but what is not in my mind now, or i don't know about it (so maybe you can help if you have an idea about this), is there a way to only consider objects which are inside the camera FOV?

If by not available in the scene do you mean that the CENTER is not in the scene, then yes. Maybe you see the front part of a car but the (center) position of that car is not in your image, but you still know that the car is inside because some vertex of the bbox is inside.

Ah then here we don't only check the center of the object, but if at least one of the vertices is inside the scene, a x_2dand y_2dwill be available for this object (but for sure not a complete bounding box) .. is this also applicable in the case of semantic segmentation, if part of the object is available in the scene, then it is shown in the semantic segmentation image ? because as I told you my idea, I want to merge the detected BB with objects in the semantic segmentation

Concerning pos_2d in this part, after transformation and everything

if pos2d[2] > 0: # if the point is in front of the camera
      x_2d = WINDOW_WIDTH - pos2d[0]
      y_2d = WINDOW_HEIGHT - pos2d[1]

what are the coordinates of pos_2d[2] (z axis), are they in meters ?, because I am working on neglecting the occluded objects, so that's why I want to use the DepthMapInformation and compare then with the position of the object from the car, so I think pos_2d[2] is the z coordinate of each BB coordinate

mhusseinsh commented 5 years ago

Ah then here we don't only check the center of the object, but if at least one of the vertices is inside the scene, a x_2dand y_2dwill be available for this object (but for sure not a complete bounding box) .. is this also applicable in the case of semantic segmentation, if part of the object is available in the scene, then it is shown in the semantic segmentation image ? because as I told you my idea, I want to merge the detected BB with objects in the semantic segmentation

@marcgpuig For this part of my question, I managed to have incomplete BB if part of the object already disappeared from the scene

I thought about something, maybe you can give a hint on this. Instead of doing all these filters for object neglecting or considering, why don't we just filter the objects which are in the camera FOV. Is this possible ? or doable ? I mean I thought, we already know the camera FOVwhich should be 90, and we have the RotationPitch, RotationYaw, RotationRollset to (0,0,0) .. but what is not in my mind now, or i don't know about it (so maybe you can help if you have an idea about this), is there a way to only consider objects which are inside the camera FOV?

And I think that this is also done by all of these for loops and conditions to neglect some objects, but maybe there is an easier way to use information directly from FOV.. I am not really sure, but anyways this is not important for me now

What I am struggling in now, is that I managed to have 2D Bounding Boxes around objects in the scene (somehow), I only have problems with occluded objects giving an example 1- if there is a car driving, and there is a building behind it, and directly behind the building, there is another car .. theoretically, it should not give BB, but it is giving and the same you can see for pedestrians on the right

Why I am having this ? is that I am using the SemanticSegmentationinfo in parallel to filter the unneeded objects. That's why I think I need also depth, but I am not sure how to use it. So my question was concerningpos_2d[2], does this by any means help ? or what is your hint/tip for this problem ?

P.S: I think also the depth information will be very important somehow, because I am currently depending on the Semantic Segmentation only, and I can see that it is very very rich of labels, with each and every single pixel, so it can detect the objects which are really far far away as you can see from my example here, but I somehow need to filter the far objects also

marcgpuig commented 5 years ago

@mhusseinsh

[[6.95273012]] [[299.90550021]] but these are float numbers, so they are not pixel coordinates, or am I mistaken?

Remember these floating points? Try to get the ingers that surround them, these will be pixel coordinates right? use these coordinates to look into the depth image to see the camera distance to these points. If your points are behind the vertex, the vertex is visible. Easy as that!

mhusseinsh commented 5 years ago

@marcgpuig thanks marc the vertex you are talking about is pos_2d[2] in that case ? So what you are saying is that I have BB coordinates, let's say

[[205.23460105  60.24636716  40.84817564]
 [199.86540798  60.24637574  40.84817564]
 [205.23459813  58.41365844  40.84817564]
 [199.86540505  58.41366702  40.84817564]
 [205.23460105  60.24636716  39.43307614]
 [199.86540798  60.24637574  39.43307614]
 [205.23459813  58.41365844  39.43307614]
 [199.86540505  58.41366702  39.43307614]]

By taking the first coordinate as an example [205.23460105 60.24636716 40.84817564] where x_2d = 205.23460105, y_2d = 60.24636716, z_2d = 40.84817564

By taking the integer of them, so x_2d = 205, and y_2d = 60 What I do, is that I check the depth image at pixel (205, 60) and get its value so you are saying that I compare this value toz_2d ? if it is greater or smaller ?

because in image_converter.py, there is this function depth_to_array(image), and this returns the depth value of each pixel normalized between [0.0, 1.0], so I assume that this is the value I am looking at, or care about, but compare it then to what ?

mhusseinsh commented 5 years ago

@marcgpuig or maybe I wrongly understoond I read this comment, and I will try to say what is understood here

from the 8 points of the 3D BB, I just get 4 points to have the 2D BB I take each point of the 4 and get its depth let's say I have a point (x, y) .. I get the depth of this point d0, then I check the depth of the surrounding 4 points

d1 = depth[x+1, y]
d2 = depth[x-1, y]
d3 = depth[x, y+1]
d4 = depth[x, y-1]

if (d1 & d2 & d3 & d4) < d0 then this coordinate is occluded ?

If this is what you mean, then here is a code example of how I did it

def check_depth(left, right, top, bottom, depth_labels):
    point1 = (top, left)
    point2 = (top, right)
    point3 = (bottom, left)
    point4 = (bottom, right)

    depth_point1 = depth_labels[top, left]
    depth_point2 = depth_labels[top, right]
    depth_point3 = depth_labels[bottom, left]
    depth_point4 = depth_labels[bottom, right]

    depth_point1_surr1 = depth_labels[top-1, left]
    depth_point1_surr2 = depth_labels[top+1, left]
    depth_point1_surr3 = depth_labels[top, left-1]
    depth_point1_surr4 = depth_labels[top, left+1]

    depth_point2_surr1 = depth_labels[top-1, right]
    depth_point2_surr2 = depth_labels[top+1, right]
    depth_point2_surr3 = depth_labels[top, right-1]
    depth_point2_surr4 = depth_labels[top, right+1]

    depth_point3_surr1 = depth_labels[bottom-1, left]
    depth_point3_surr2 = depth_labels[bottom+1, left]
    depth_point3_surr3 = depth_labels[bottom, left-1]
    depth_point3_surr4 = depth_labels[bottom, left+1]

    depth_point4_surr1 = depth_labels[bottom-1, right]
    depth_point4_surr2 = depth_labels[bottom+1, right]
    depth_point4_surr3 = depth_labels[bottom, right-1]
    depth_point4_surr4 = depth_labels[bottom, right+1]

    if ((depth_point1_surr1 < depth_point1) and (depth_point1_surr2 < depth_point1) and (depth_point1_surr3 < depth_point1) and(depth_point1_surr4 < depth_point1)):
        if ((depth_point2_surr1 < depth_point2) and (depth_point2_surr2 < depth_point2) and (depth_point2_surr3 < depth_point2) and(depth_point2_surr4 < depth_point2)):
            if ((depth_point3_surr1 < depth_point3) and (depth_point3_surr2 < depth_point3) and (depth_point3_surr3 < depth_point3) and(depth_point3_surr4 < depth_point3)):
                if ((depth_point4_surr1 < depth_point4) and (depth_point4_surr2 < depth_point4) and (depth_point4_surr3 < depth_point4) and(depth_point4_surr4 < depth_point4)):
                    print ('object occluded')
    print ('object not occluded')

but unfortunately, all objects are not occluded .. so correct me if I am wrong please

And if this approach is correct and this what you mean I don't get it that much Let's say we have a car behind a building, technically I have its coordinates because it passed all the previous filters for being agent.HasField('vehicle') and pos2d[2] > 0 and it is lying between the window frame 0-WINDOW_WIDTH in x-direction and 0-WINDOW_HEIGHT in y-direction

Now the next step, that I want to filter this point out, because it is not available in the scene and it is occluded by a building

I get the 8 points of the 3D bounding box of this car let's say a point of them is (100, 150) By checking the depth of this point in the depth_array and its surrounding points (101, 150), (99, 150), (100, 151), (100, 149), they should have the same depth This has to do nothing with the depth of the car, because what is available now in the depth image is the building which is in the scene, not the car so the depth pixel of this point (which corresponds to the building) as well as the depth of the surrounding pixels (which also correspond to the building will be the same) Am I right or wrong ?

mhusseinsh commented 5 years ago

@marcgpuig what is the bounding box vertex ? is it the 3rd coordinate of the 8 BB Matrix as I mentioned here ?, if yes, then how is this compared, because this value is mostly above 1.0, which is not comparable to the value returned from the depth image depth_to_array(image) Oh your comment is already deleted .. so this was a reply to what you commented a while ago

mhusseinsh commented 5 years ago

@marcgpuig Hi Marc, any idea about my issue? I am right now struggled in this, and I believe that this is my final step to remove all objects which are not in scene. Your help would be appreciated...

marcgpuig commented 5 years ago

@mhusseinsh I have no time to test this for you sorry :( Another thing you can do is just see the 3D distance between this vertex and the camera itself, and compare it with the 3D distance between the camera and these 4 points you found earlier. If you want an example on how to get the 3D position of one pixel in the screen, you can check the point_cloud_example.py. I think this approach will be easier!

mhusseinsh commented 5 years ago

@marcgpuig is this what you mean ? something like this ? if (abs(self._measurements.player_measurements.transform.location.z - agent.vehicle.transform.location.z) <= 1)

mhusseinsh commented 5 years ago

@marcgpuig and sorry again, what is this vertex that you are referring to ?

mhusseinsh commented 5 years ago

@marcgpuig Ok then just to keep on track of what you said I already have the 3D position of each point of the BB in camera plane which again, they are 4 of this 8 points (3D BB)

[[205.23460105  60.24636716  40.84817564]
 [199.86540798  60.24637574  40.84817564]
 [205.23459813  58.41365844  40.84817564]
 [199.86540505  58.41366702  40.84817564]
 [205.23460105  60.24636716  39.43307614]
 [199.86540798  60.24637574  39.43307614]
 [205.23459813  58.41365844  39.43307614]
 [199.86540505  58.41366702  39.43307614]]

so let's assume that the 4 coordinates of the BB are

[[205.23460105  60.24636716  40.84817564]
 [199.86540798  60.24637574  40.84817564]
 [205.23459813  58.41365844  40.84817564]
 [199.86540505  58.41366702  40.84817564]]

so what you said, is that I take the depth image, and transform it to point cloud (3D coordinates in Camera Plane not World Coordinates)

so for the each and every pixel of the 4, I compare the 3D coordinates I have in the beginning with the 3D coordinates I got from point cloud, and if they are exactly equal, then this is the object which has to be taken into consideration, is this what you mean ?

mhusseinsh commented 5 years ago

@mhusseinsh I have no time to test this for you sorry :(

I know that for sure you are so busy, and i appreciate that .. but I never asked for any testing from your side at all, I just need to understand it and I will do it alone, so just try to point me to what is in your mind, nothing more. Concerning the first approach, I still have doubts into it, as written in my previous comments, so if you can clarify or clear these doubts, it would be perfect

Also concerning the second approach you said, just confirm with me if possible if I understood it or not ? but to get a point cloud for a single pixel .. I checked the function and it is quite complicated to understand how to do it for a single pixel, so I would assume that using this function depth_to_local_point_cloud(), I take the returned array and search within it, but I don't think this is a smart/efficient way ... Help would be really appreciated .. I have been struggled in this for more than 3 weeks up till now

mhusseinsh commented 5 years ago

@marcgpuig following your description of the second approach I tried to pursue something

    k = np.identity(3)
    k[0, 2] = WINDOW_WIDTH_HALF
    k[1, 2] = WINDOW_HEIGHT_HALF
    k[0, 0] = k[1, 1] = WINDOW_WIDTH / \
        (2.0 * math.tan(90.0 * math.pi / 360.0))

    far = 1000.0
    # top, left is y_min and x_min (which are 1 coordinate of the 4 coordinates of the BB)
    point1 = [top, left, 1]
    p3d = np.dot(np.linalg.inv(k), point1)
    # depth_labels is the output array from depth_to_array(image)
    normalized_depth = depth_labels[top, left]
    p3d *= normalized_depth * far

then by that p3dis a 3 elements array ... if this is what you mean, then what should I do next ?

YuqiShen commented 5 years ago

Hi, @mhusseinsh Do you solve the problem of depth filter? Could you share the code for that? Thank you very much!

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ThierryDubois commented 5 years ago

What I am struggling in now, is that I managed to have 2D Bounding Boxes around objects in the scene (somehow), I only have problems with occluded objects giving an example

I am trying to obtain 2D bounding boxes on 0.9.5, do you have some code snippets/reference/tips I could use to get those 2D boxes? I have not seen any update concerning this subject since a while ago. Do you suggest I only use 0.8.4? I will try to read more about this for a while. Thanks a lot for your help.

marcgpuig commented 5 years ago

Hi @ThierryDubois Please, take a look at PythonAPI/examples/client_bounding_boxes.py. Hope it helps :)

ThierryDubois commented 5 years ago

Hi @ThierryDubois Please, take a look at PythonAPI/examples/client_bounding_boxes.py. Hope it helps :)

Hi, thanks for the link! Is this intended for the latest release or the stable version?

marcgpuig commented 5 years ago

@ThierryDubois This is for the latest release. Also, I don't recommend using 0.8.x any more.

ThierryDubois commented 5 years ago

@marcgpuig Alright thanks a lot!

Edit: By the way, I had to copy/paste the import the find-carla-module code from manual control, because it would not find it.

marcgpuig commented 5 years ago

@ThierryDubois Oops! Thanks for the report, I'll take a look at it and will fix it ASAP. Cheers!

Vin1291 commented 3 years ago

@mhusseinsh can you share the code for depth? not able to filter out objects

carla-simulator / carla

8 points pixel coordinates (3D bounding box) #772