allenai / ai2thor

An open-source platform for Visual AI.
http://ai2thor.allenai.org
Apache License 2.0
1.09k stars 210 forks source link

Some room sizes do not match top-down view #1181

Closed Thomahawkuru closed 4 months ago

Thomahawkuru commented 5 months ago

Some context: Our aim is to create an interactive map from the top-down-image, where a user can click on a certain reachable position within the room to move the robot to. For this we need an accurate mapping from pixels of the top-down-image to room coordinates. The first step is to accurately draw the rooms bounding box using metadata such as 'scenebounds', 'center', 'corner_points, etc. and to determine a pixel-to-position factor from the room size and bounding box size.

To achieve the above I have coded a script that outputs the 2D top-down-image, for example in 1000x1000 pixels, where the largest dimension of the rooms size will correspond to 1000 pixels. Using the size information in the room metadata, I then determine the pixel-to-coordinates scale factor by dividing 1000 pixels by the largest room dimension. And using this factor I draw the a bounding box of the room size using the corner points and room center meta data, as well as the reachable positions onto the top-down-image.

Related Code

pixels = 1000

def get_top_down_frame(controller: Controller):
    # Setup the top-down camera
    event = controller.step(action="GetMapViewCameraProperties", raise_for_failure=True)
    pose = copy.deepcopy(event.metadata["actionReturn"])
    pose["orthographic"] = True

    # add the camera to the scene
    event = controller.step(
        action="AddThirdPartyCamera",
        **pose,
        skyboxColor="white",
        raise_for_failure=True,
    )
    top_down_frame = event.third_party_camera_frames[-1]
    return Image.fromarray(top_down_frame)

def make_square(center, size, xs, zs):
    offset = {'x': 0, 'z':0}

    # Scale the xs and zs by a factor (e.g., size['y']/1000)
    if size['x'] > size['z']:
        f = pixels/(size['x'])
        offset['z'] = (size['x']-size['z'])/2
    elif size['x'] < size['z']:
        f = pixels/(size['z'])
        offset['x'] = (size['z']-size['x'])/2
    print('scale: ', f)
    print('offset: ', offset)

    x = [a*f for a in [size['x']/2 - size['x']/2 + offset['x'], 
                       size['x']/2 + size['x']/2 + offset['x'], 
                       size['x']/2 + size['x']/2 + offset['x'], 
                       size['x']/2 - size['x']/2 + offset['x'], 
                       size['x']/2 - size['x']/2 + offset['x']]]
    z = [b*f for b in [size['z']/2 - size['z']/2 + offset['z'], 
                       size['z']/2 - size['z']/2 + offset['z'], 
                       size['z']/2 + size['z']/2 + offset['z'], 
                       size['z']/2 + size['z']/2 + offset['z'], 
                       size['z']/2 - size['z']/2 + offset['z']]]

    xs_scaled = [(x-size['x']/2)*f + pixels/2 for x in xs]
    zs_scaled = [-(z-size['z']/2)*f + pixels/2 for z in zs]

    return [x, z], [xs, zs], f, offset

def plot_bounding_box_on_top_down(controller: Controller):
    top_down_image = get_top_down_frame(controller)
    event = controller.step(action="GetReachablePositions")
    reachable_positions = event.metadata["actionReturn"]
    xr = [rp["x"] for rp in reachable_positions]
    zr = [rp["z"] for rp in reachable_positions]

    #get scene bounds
    size = controller.last_event.metadata['sceneBounds']['size']
    center = controller.last_event.metadata['sceneBounds']['center']
    cornerpoints = controller.last_event.metadata['sceneBounds']['cornerPoints']

    # Plot reachable positions on top of the top-down image old
    sq, pos, f, o = make_square(center, size, xr, zr)

    fig1 = plt.figure(figsize=(8, 8))
    plt.imshow(top_down_image)
    plt.plot(sq[0], sq[1], 'b-', label='walls')
    plt.xlabel("$x$")
    plt.ylabel("$z$")
    plt.title("Bounding box on Top-Down View")
    plt.axis('equal')
    plt.legend()

    return fig1

The issue For some of the procthor-10k houses, the bounding box and respective reachable positions align perfectly with the rooms top-down-image. But, for most of the rooms the bounding box, does not align and seems to have different dimensional relations compared to the top_down_image. It seems that sometimes the windows are included in the scene bounds, but other times they may not. Sometimes the difference is small, but other times the room size is way off. To correctly calculated the pixel-to-position factor for the reachable positions on the top-down-view, a correct bounding box is necessary.

Any ideas on why this meta-data information seems inconsistent? I would also help to have some information on how the scene bounds metadata property is determined for each of the houses. Are window-frames and door knobs included for example? Other suggestions on the archiving the relation of image to room positions in another way are of course also welcome.

Some examples rooms=[1327, 5127, 5364, 5878, 8413] 1327 5127 5364 5878 8413

Thomahawkuru commented 4 months ago

Solved. Found the correct way of doing this in another issue: #124

Thanks to @Lucaweihs for the class ThorPositionTo2DFrameTranslator(object), which allows correct relation between pixels and coordinates to draw on the top down view.