MarSaKi / VLN-BEVBert

[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for Language-guided Navigation"
184 stars 5 forks source link
embodied-ai transformer vision-language

BEVBert: Multimodal Map Pre-training for
Language-guided Navigation

Dong An; Yuankai Qi; Yangguang Li; Yan Huang; Liang Wang; Tieniu Tan; Jing Shao;

Accepted to ICCV 2023

Paper

Abstract

Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent’s spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-ofthe-art on four VLN benchmarks (R2R, R2R-CE, RxR, REVERIE).

Method

TODOs

Setup

Installation

  1. Create a virtual environment. We develop this project with Python 3.6.

    conda env create -f environment.yaml
  2. Install the latest version of Matterport3DSimulator, including the Matterport3D RGBD datasets (for step 6).
  3. Download the Matterport3D scene meshes. download_mp.py must be obtained from the Matterport3D project webpage. download_mp.py is also used for downloading RGBD datasets in step 2.
# run with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
# Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb

Follow the Habitat Installation Guide to install habitat-sim and habitat-lab. We use version v0.1.7 in our experiments. In brief:

  1. Install habitat-sim for a machine with multiple GPUs or without an attached display (i.e. a cluster):

    conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
  2. Clone habitat-lab from the github repository and install. The command below will install the core of Habitat Lab as well as the habitat_baselines.

    git clone --branch v0.1.7 git@github.com:facebookresearch/habitat-lab.git
    cd habitat-lab
    python setup.py develop --all # install habitat and habitat_baselines
  3. Grid feature preprocessing for metric mapping (~100G).

    # for R2R, RxR, REVERIE
    python precompute_features/grid_mp3d_clip.py
    python precompute_features/grid_mp3d_imagenet.py
    python precompute_features/grid_depth.py
    python precompute_features/grid_sem.py
    
    # for R2R-CE pre-training
    python precompute_features/grid_habitat_clip.py
    python precompute_features/save_habitat_img.py --img_type depth
    python precompute_features/save_depth_feature.py
  4. Download preprocessed instruction datasets and trained weights [link]. The directory structure has been organized. For R2R-CE experiments, follow ETPNav to configure VLN-CE datasets in bevbert_ce/data foler, and put the trained CE weights [link] in bevbert_ce/ckpt.

Good luck on your VLN journey with BEVBert!

Running

Pre-training. Download precomputed image features [link] into folder img_features.

CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/pt_r2r.bash 2333  # R2R
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/pt_rxr.bash 2333  # RxR
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/pt_rvr.bash 2333  # REVERIE

cd bevbert_ce/pretrain 
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_pt/run_r2r.bash 2333  # R2R-CE

Fine-tuning and Testing, the trained weights can be found in step 7.

CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/ft_r2r.bash 2333  # R2R
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/ft_rxr.bash 2333  # RxR
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/ft_rvr.bash 2333  # REVERIE

cd bevbert_ce
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_r2r/main.bash [train/eval/infer] 2333  # R2R-CE

Contact Information

Acknowledge

Our implementations are partially inspired by DUET, S-MapNet and ETPNav.

Thank them for open sourcing their great works!

Citation

If you find this repository is useful, please consider citing our paper:

@article{an2023bevbert,
  title={BEVBert: Multimodal Map Pre-training for Language-guided Navigation},
  author={An, Dong and Qi, Yuankai and Li, Yangguang and Huang, Yan and Wang, Liang and Tan, Tieniu and Shao, Jing},
  journal={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}