Ayça Takmaz1*,
Elisabetta Fedele1*
Robert W. Sumner1,
Marc Pollefeys1,2,
Federico Tombari1,3,
Francis Engelmann1,3
1ETH Zurich,
2Microsoft,
3Google
*equal contribution
OpenMask3D is a zero-shot approach for 3D instance segmentation with open-vocabulary queries. Guided by predicted class-agnostic 3D instance masks, our model aggregates per-mask features via multi-view fusion of CLIP-based image embeddings.
Clone the repository, create conda environment and install the required packages as follows:
conda create --name=openmask3d python=3.8.5 # create new virtual environment
conda activate openmask3d # activate it
bash install_requirements.sh # install requirements
pip install -e . # install current repository in editable mode
Note: If you encounter any issues in the bash install_requirements.sh
step, we recommend you to run the commands in that script one-by-one, especially for performing the MinkowskiEngine installation manually.
In this section we provide some information about how to run the pipeline on a single scene. In particular, we divide this section into four parts:
Create a folder resources
in the main directory of the repository. Then, add to this folder the checkpoints for:
In order to run OpenMask3D you need to have access to the point cloud of the scene as well to the posed RGB-D frames.
We recommend creating a folder scene_example
inside the resources
folder where the data is saved with the following structure (here we provide a scene as an example).
scene_example
βββ pose <- folder with camera poses
β βββ 0.txt
β βββ 1.txt
β βββ ...
βββ color <- folder with RGB images
β βββ 0.jpg (or .png/.jpeg)
β βββ 1.jpg (or .png/.jpeg)
β βββ ...
βββ depth <- folder with depth images
β βββ 0.png (or .jpg/.jpeg)
β βββ 1.png (or .jpg/.jpeg)
β βββ ...
βββ intrinsic
β βββ intrinsic_color.txt <- camera intrinsics
βββ scene_example.ply <- point cloud of the scene
Please note the followings:
.ply
file and the points are expected to be in the z-up right-handed coordinate system..txt
file, containing a 4x4 matrix..png
, .jpg
, .jpeg
format; the used format should be specified as explained in Step 3.{FRAME_ID}.extension
, without zero padding for the frame ID, starting from index 0.Before running OpenMask3D make sure to fill all the required parameters in this script. In particular, if you have followed the structure provided in Step 2, you should adapt only the following fields:
SCENE_DIR
: directory to scene_example
SCENE_INTRINSIC_RESOLUTION
: resolution on which intrinsics are computedIMG_EXTENSION
: extension of RGB pictures. Either .png
, .jpg
, .jpeg
DEPTH_EXTENSION
: extension of depth pictures. Either .png
, .jpg
, .jpeg
DEPTH_SCALE
: factor by which the depth of the sensor should be divided to obtain a measure in terms of meters. It should be set to 1000 for ScanNet depth images and to 6553.5 for Replica depth images. You should set this value based on the scale of your depth maps.MASK_MODULE_CKPT_PATH
: path to the mask module network checkpointSAM_CKPT_PATH
: path to the Segment Anything Model (SAM) checkpointOUTPUT_FOLDER_DIRECTORY
: path to the folder in which you wish to save the outputsSAVE_VISUALIZATIONS
: set to true if you wish to save the visualizations of the class-agnostic masksSAVE_CROPS
: set to true if you wish to save the 2D crops of the masks from which the CLIP features are extracted. It can be helpful for debugging and for a qualitative evaluation of the quality of the masks.OPTIMIZE_GPU_USAGE
: set to true if you have some memory constraints and wish to minimize GPU memory footprint. Please note that this version is slower compared to the our default version.Now you can run OpenMask3D by using the following command.
bash run_openmask3d_single_scene.sh
This script first extracts and saves the class-agnostic masks, and then computes the per-mask features. Masks and mask-features are saved into the directory specified by the user at the beginning of this script. In particular, the output has the following structure.
OUTPUT_FOLDER_DIRECTORY
βββ date-time-experiment_name <- folder with the output of a specific experiment
βββ crops <- folder with crops (if SAVE_CROPS=true)
βββ hydra_outputs <- folder with outputs from hydra (config.yaml files are useful)
βββ scene_example_masks.pt <- class-agnostic instance masks - dim. (num_points, num_masks) indicating the masks in which a given point is included
βββ scene_example_openmask3d_features.npy <- per-mask features for each object instance - dim. (num_masks, num_features), the mask-feature vecture for each instance mask.
Note: For the ScanNet validation, we use available segments on ScanNet and obtain more robust and less noisy masks compared to directly running the mask predictor on the point cloud. Therefore, the results we obtain for a single scene from ScanNet directly using the point cloud can be different then the masks obtained during the overall ScanNet evaluation described in the section below.
Other configuration parameters can be modified from this file. Here we provide some clarifications of other configuration parameters:
multi_level_expansion_ratio
: factor of increment of the crop dimension for using multi-level image cropsopenmask3d.frequency
: the frequency with which we want to process the frames given in input (e.g. a frequency of 10 takes 1 image in every 10 frames)openmask3d.num_random_rounds
and openmask3d.num_selected_points
: sets the number of iterations and the number of sampled points for SAM.In this section we outline the steps to take in order to reproduce our results on the ScanNet200 validation set. In particular, we divide this section into four parts:
First, you need to download the ScanNet200 dataset as explained here.
Once you have the dataset, you have to clone the ScanNet repository and process the dataset by using the following command.
cd class_agnostic_mask_computation
python -m datasets.preprocessing.scannet_preprocessing preprocess \
--data_dir="PATH_TO_ORIGINAL_SCANNET_DATASET" \
--save_dir="data/processed/scannet" \
--git_repo="PATH_TO_SCANNET_GIT_REPO" \
--scannet200=true
Make sure to have the data in the following form.
scans <- out folder
βββ scene_0011_00
β βββ data
β β βββ intrinsic <- folder with the intrinsics
β β βββ pose <- folder with the poses
β βββ data_compressed
β β βββ color <- folder with the color images
β β βββ depth <- folder with the depth images
β βββ scene_0011_00_vh_clean_2.ply <- path to the point cloud/mesh ply file
βββ scene0011_01
β βββ data
β β βββ intrinsic
β β βββ pose
β βββ data_compressed
β β βββ color
β β βββ depth
β βββ scene_0011_01_vh_clean_2.ply
...
Modify the paths and parameters in this script, following the instructions provided there.
Now you can compute the per-mask scene features and run the evaluation of OpenMask3D on the whole ScanNet200 dataset by using the following command:
bash run_openmask3d_scannet200_eval.sh
This script first extracts and saves the class-agnostic masks, and then computes the mask features associated with each extracted mask. Afterwards, the evaluation script automatically runs in order to obtain 3D closed-vocabulary semantic instance segmentation scores.
@inproceedings{takmaz2023openmask3d,
title={{OpenMask3D: Open-Vocabulary 3D Instance Segmentation}},
author={Takmaz, Ay{\c{c}}a and Fedele, Elisabetta and Sumner, Robert W. and Pollefeys, Marc and Tombari, Federico and Engelmann, Francis},
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
year={2023}
}