D³Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

Website | Paper | Colab | Doc

D³Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

Yixuan Wang¹, Zhuoran Li^{2, 3}, Mingtong Zhang¹, Katherine Driggs-Campbell¹, Jiajun Wu², Li Fei-Fei², Yunzhu Li^{1, 2}

¹University of Illinois Urbana-Champaign, ²Stanford University, ³National University of Singapore

https://github.com/WangYixuan12/d3fields/assets/32333199/a3fced3d-e827-4e7e-ad6a-e80889809fca

Try it in Colab!

In this notebook, we show how to build D³Fields and visualize reconstructed mesh, mask fields, and descriptor fields. We also demonstrate how to track keypoints of a video.

Installation

We recommend Mambaforge instead of the standard anaconda distribution for faster installation:

# create conda environment
mamba env create -f env.yaml
conda activate d3fields

# download pretrained models
bash scripts/download_ckpts.sh
bash scripts/download_data.sh

Visualization

python vis_repr.py # visualize the representation
python vis_tracking.py # visualize the tracking

Code Explanation

Fusion is the core class of D³Fields. It contains the following key functions:

update: it takes in the observation and updates the internal states.
text_queries_for_inst_mask: it will query the instance mask according to the text query and thresholds.
text_queries_for_inst_mask_no_track: it is similar to text_queries_for_inst_mask, but it will not invoke the underlying XMem tracking module.
eval: it will evaluate associated features for arbitrary 3D points.
batch_eval: for a large batch of points, it will evaluate them batch by batch to avoid out-of-memory error. The important attributes of Fusion are:
curr_obs_torch: a dictionary containing the following keys:
- color: multiview color images in the format of np.uint8 BGR numpy arrays
- color_tensor: multiview color images in the format of float32 BGR torch tensors
- depth: multiview depth images in the format of np.float32 torch tensors, unit in meters
- mask: multiview instance mask images in the format of np.uint8 torch tensors (V, H, W, num_inst)
- consensus_mask_label: mask labels aggregated from all views in the format of a list of strings.

Customized Dataset

To run D³Fields on your own dataset, you could follow the following steps:

Prepare dataset in the following structure:

dataset_name
├── camera_0
│   ├── color
|   |   ├── 0.png
|   |   ├── 1.png
|   |   ├── ...
│   ├── depth
|   |   ├── 0.png
|   |   ├── 1.png
|   |   ├── ...
│   ├── camera_extrinsics.npy
│   ├── camera_params.npy
├── camera_1
├── ...

The definition of camera_extrinsics.npy and camera_params.npy is defined as follows:

camera_extrinsics.npy: (4, 4) numpy array, the extrinsics of the camera, which transforms a point from world coordinate to camera coordinate
camera_params.npy: (4,) numpy array, the camera parameters in the following order: fx, fy, cx, cy

Prepare the PCA pickle file for the query texts. Find four images of the queries texts (e.g. mug) with clean bakcground and central objects. Change obj_type within scripts/prepare_pca.py and run it.
Specify the workspace boundary as x_lower, x_upper, y_lower, y_upper, z_lower, z_upper.
Run python vis_repr_custom.py, such as python vis_repr_custom.py --data_path data/2023-09-15-13-21-56-171587 --pca_path pca_model/mug.pkl --query_texts mug --query_thresholds 0.3 --x_lower -0.4 --x_upper 0.4 --y_upper 0.3 --y_lower -0.4 --z_upper 0.02 --z_lower -0.2

Tips for debugging:

Make sure the transformation is right by visualizing pcd within vis_repr_custom.py using Open3D.
If the GPU is out of memory, run vis_repr_custom.py with smaller step. This will generate a more sparse voxel grid.
Make sure Grounded SAM outputs reasonable results by checking curr_obs_torch['mask'] and curr_obs_torch['consensus_mask_label'] of Fusion class.

Citation

If you find this repo useful for your research, please consider citing the paper

@article{wang2023d3fields,
    title={D$^3$Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation},
    author={Wang, Yixuan and Li, Zhuoran and Zhang, Mingtong and Driggs-Campbell, Katherine and Wu, Jiajun and Fei-Fei, Li and Li, Yunzhu},
    journal={arXiv preprint arXiv:2309.16118},
    year={2023}
}

WangYixuan12 / d3fields

readme

D³Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

Website | Paper | Colab | Doc

Try it in Colab!

Installation

Visualization

Code Explanation

Customized Dataset

Citation

WangYixuan12 / d3fields

readme

D3Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation

Website | Paper | Colab | Doc

Try it in Colab!

Installation

Visualization

Code Explanation

Customized Dataset

Citation

D³Fields: Dynamic 3D Descriptor Fields for Zero-Shot Generalizable Robotic Manipulation