apple / ml-hypersim

Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
Other
1.73k stars 133 forks source link

Instance segmentation masks #54

Closed DianCh closed 1 year ago

DianCh commented 1 year ago

Hi! Thank you for releasing this wonderful dataset. The intro of README says this dataset "includes dense per-pixel semantic instance segmentations and complete camera information for every image", but later sections indicate that we need to install the low/high-level toolkit to render masks from meshes ourselves. Before heading straight to installing a bunch of tools, could I ask: do we have the rendered instance segmentations ready for use in the downloaded files?

Thank you very much!

mikeroberts3000 commented 1 year ago

Hi! Great question, and thank you for your kind words.

I don't believe our README indicates that it is necessary to run our code to obtain segmentation images. We provide our code in case someone wants to render their own custom images (e.g., higher resolutions, different camera poses, custom camera parameters, etc), but you don't need to run any of it if you just want to work with our pre-rendered images and segmentation masks.

We describe where to find semantic segmentation images and instance segmentation images in the Working with the Hypersim Dataset section of the README (quoted below). The frame.IIII.semantic.hdf5 and frame.IIII.semantic_instance.hdf5 files corresponding to each image contain our semantic and instance segmentation masks. We also provide color visualizations of the segmentation masks in our scene_cam_XX_geometry_preview directories, but in downstream applications, we recommend using the HDF5 files because they store the integer segmentation IDs directly.

I hope that helps!

The Hypersim Dataset consists of a collection of synthetic scenes. Each scene has a name of the form ai_VVV_NNN where VVV is the volume number, and NNN is the scene number within the volume. For each scene, there are one or more camera trajectories named {cam_00, cam_01, ...}. Each camera trajectory has one or more images named {frame.0000, frame.0001, ...}. Each scene is stored in its own ZIP file according to the following data layout:

ai_VVV_NNN
├── _detail
│   ├── metadata_cameras.csv                     # list of all the camera trajectories for this scene
│   ├── metadata_node_strings.csv                # all human-readable strings in the definition of each V-Ray node
│   ├── metadata_nodes.csv                       # establishes a correspondence between the object names in an exported OBJ file, and the V-Ray node IDs that are stored in our render_entity_id images
│   ├── metadata_scene.csv                       # includes the scale factor to convert asset units into meters
│   ├── cam_XX                                   # camera trajectory information
│   │   ├── camera_keyframe_orientations.hdf5    # camera orientations
│   │   └── camera_keyframe_positions.hdf5       # camera positions (in asset coordinates)
│   ├── ...
│   └── mesh                                                                            # mesh information
│       ├── mesh_objects_si.hdf5                                                        # NYU40 semantic label for each object ID (available in our public code repository)
│       ├── mesh_objects_sii.hdf5                                                       # semantic instance ID for each object ID (available in our public code repository)
│       ├── metadata_objects.csv                                                        # object name for each object ID (available in our public code repository)
│       ├── metadata_scene_annotation_tool.log                                          # log of the time spent annotating each scene (available in our public code repository)
│       ├── metadata_semantic_instance_bounding_box_object_aligned_2d_extents.hdf5      # length (in asset units) of each dimension of the 3D bounding for each semantic instance ID
│       ├── metadata_semantic_instance_bounding_box_object_aligned_2d_orientations.hdf5 # orientation of the 3D bounding box for each semantic instance ID
│       └── metadata_semantic_instance_bounding_box_object_aligned_2d_positions.hdf5    # position (in asset coordinates) of the 3D bounding box for each semantic instance ID
└── images
    ├── scene_cam_XX_final_hdf5                  # lossless HDR image data that requires accurate shading
    │   ├── frame.IIII.color.hdf5                # color image before any tone mapping has been applied
    │   ├── frame.IIII.diffuse_illumination.hdf5 # diffuse illumination
    │   ├── frame.IIII.diffuse_reflectance.hdf5  # diffuse reflectance (many authors refer to this modality as "albedo")
    │   ├── frame.IIII.residual.hdf5             # non-diffuse residual
    │   └── ...
    ├── scene_cam_XX_final_preview               # preview images
    |   └── ...
    ├── scene_cam_XX_geometry_hdf5               # lossless HDR image data that does not require accurate shading
    │   ├── frame.IIII.depth_meters.hdf5         # Euclidean distances (in meters) to the optical center of the camera
    │   ├── frame.IIII.position.hdf5             # world-space positions (in asset coordinates)
    │   ├── frame.IIII.normal_cam.hdf5           # surface normals in camera-space (ignores bump mapping)
    │   ├── frame.IIII.normal_world.hdf5         # surface normals in world-space (ignores bump mapping)
    │   ├── frame.IIII.normal_bump_cam.hdf5      # surface normals in camera-space (takes bump mapping into account)
    │   ├── frame.IIII.normal_bump_world.hdf5    # surface normals in world-space (takes bump mapping into account)
    │   ├── frame.IIII.render_entity_id.hdf5     # fine-grained segmentation where each V-Ray node has a unique ID
    │   ├── frame.IIII.semantic.hdf5             # NYU40 semantic labels
    │   ├── frame.IIII.semantic_instance.hdf5    # semantic instance IDs
    │   ├── frame.IIII.tex_coord.hdf5            # texture coordinates
    │   └── ...
    ├── scene_cam_XX_geometry_preview            # preview images
    |   └── ...
    └── ...