Large-scale pre-training has shown promising results on the vision-and-language navigation (VLN) task. However, most existing pre-training methods employ discrete panoramas to learn visual-textual associations. This requires the model to implicitly correlate incomplete, duplicate observations within the panoramas, which may impair an agent’s spatial understanding. Thus, we propose a new map-based pre-training paradigm that is spatial-aware for use in VLN. Concretely, we build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. This hybrid design can balance the demand of VLN for both short-term reasoning and long-term planning. Then, based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal. Extensive experiments demonstrate the effectiveness of the map-based pre-training route for VLN, and the proposed method achieves state-ofthe-art on four VLN benchmarks (R2R, R2R-CE, RxR, REVERIE).
Create a virtual environment. We develop this project with Python 3.6.
conda env create -f environment.yaml
download_mp.py
must be obtained from the Matterport3D project webpage. download_mp.py
is also used for downloading RGBD datasets in step 2.# run with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/
# Extract to: ./data/scene_datasets/mp3d/{scene}/{scene}.glb
Follow the Habitat Installation Guide to install habitat-sim
and habitat-lab
. We use version v0.1.7
in our experiments. In brief:
Install habitat-sim
for a machine with multiple GPUs or without an attached display (i.e. a cluster):
conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
Clone habitat-lab
from the github repository and install. The command below will install the core of Habitat Lab as well as the habitat_baselines.
git clone --branch v0.1.7 git@github.com:facebookresearch/habitat-lab.git
cd habitat-lab
python setup.py develop --all # install habitat and habitat_baselines
Grid feature preprocessing for metric mapping (~100G).
# for R2R, RxR, REVERIE
python precompute_features/grid_mp3d_clip.py
python precompute_features/grid_mp3d_imagenet.py
python precompute_features/grid_depth.py
python precompute_features/grid_sem.py
# for R2R-CE pre-training
python precompute_features/grid_habitat_clip.py
python precompute_features/save_habitat_img.py --img_type depth
python precompute_features/save_depth_feature.py
bevbert_ce/data
foler, and put the trained CE weights [link] in bevbert_ce/ckpt
.Good luck on your VLN journey with BEVBert!
Pre-training. Download precomputed image features [link] into folder img_features
.
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/pt_r2r.bash 2333 # R2R
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/pt_rxr.bash 2333 # RxR
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/pt_rvr.bash 2333 # REVERIE
cd bevbert_ce/pretrain
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_pt/run_r2r.bash 2333 # R2R-CE
Fine-tuning and Testing, the trained weights can be found in step 7.
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/ft_r2r.bash 2333 # R2R
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/ft_rxr.bash 2333 # RxR
CUDA_VISIBLE_DEVICES=0,1,2,3 bash scripts/ft_rvr.bash 2333 # REVERIE
cd bevbert_ce
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_r2r/main.bash [train/eval/infer] 2333 # R2R-CE
Our implementations are partially inspired by DUET, S-MapNet and ETPNav.
Thank them for open sourcing their great works!
If you find this repository is useful, please consider citing our paper:
@article{an2023bevbert,
title={BEVBert: Multimodal Map Pre-training for Language-guided Navigation},
author={An, Dong and Qi, Yuankai and Li, Yangguang and Huang, Yan and Wang, Liang and Tan, Tieniu and Shao, Jing},
journal={Proceedings of the IEEE/CVF International Conference on Computer Vision},
year={2023}
}