ETPNav: Evolving Topological Planning for
Vision-Language Navigation in
Continuous Environments

Dong An; Hanqing Wang; Wenguan Wang; Zun Wang; Yan Huang; Keji He; Liang Wang;

Accepted to TPAMI 2024

Paper

🔥Winner of the RxR-Habitat Challenge in CVPR 2022. [Challenge Report] [Challenge Certificate]

This work tackles a practical yet challenging VLN setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively.

Leadboard:

TODOs

[X] Tidy and release the R2R-CE fine-tuning code.
[X] Tidy and release the RxR-CE fine-tuning code.
[X] Release the pre-training code.
[X] Release the checkpoints.

Setup

Installation

Follow the Habitat Installation Guide to install habitat-lab and habitat-sim. We use version v0.1.7 in our experiments, same as in the VLN-CE, please refer to the VLN-CE page for more details. In brief:

Create a virtual environment. We develop this project with Python 3.6.
```
conda env create -f environment.yaml
```
Install habitat-sim for a machine with multiple GPUs or without an attached display (i.e. a cluster):
```
conda install -c aihabitat -c conda-forge habitat-sim=0.1.7 headless
```

Clone this repository and install all requirements for habitat-lab, VLN-CE and our experiments. Note that we specify gym==0.21.0 because its latest version is not compatible with habitat-lab-v0.1.7.

git clone git@github.com:MarSaKi/ETPNav.git
cd ETPNav
python -m pip install -r requirements.txt
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 -f https://download.pytorch.org/whl/torch_stable.html

Clone a stable habitat-lab version from the github repository and install. The command below will install the core of Habitat Lab as well as the habitat_baselines.
```
git clone --branch v0.1.7 git@github.com:facebookresearch/habitat-lab.git
cd habitat-lab
python setup.py develop --all # install habitat and habitat_baselines
```

Scenes: Matterport3D

Instructions copied from VLN-CE:

Matterport3D (MP3D) scene reconstructions are used. The official Matterport3D download script (download_mp.py) can be accessed by following the instructions on their project webpage. The scene data can then be downloaded:

# requires running with python 2.7
python download_mp.py --task habitat -o data/scene_datasets/mp3d/

Extract such that it has the form scene_datasets/mp3d/{scene}/{scene}.glb. There should be 90 scenes. Place the scene_datasets folder in data/.

Data and Trained Weights

Waypoint Predictor: data/wp_pred/check_cwp_bestdist*
- For R2R-CE, data/wp_pred/check_cwp_bestdist_hfov90 [link].
- For RxR-CE, data/wp_pred/check_cwp_bestdist_hfov63[link] (modify the suffix to hfov63).

Processed data, pre-trained weight, fine-tuned weight [link].

unzip etp_ckpt.zip    # file/fold structure has been organized

overall, files and folds are organized as follows:

ETPNav
├── data
│   ├── datasets
│   ├── logs
│   ├── scene_datasets
│   └── wp_pred
└── pretrained
  └── ETP

Running

Pre-training

Download the pretraining datasets [link] (the same one used in DUET) and precomputed features [link], unzip in folder pretrain_src

CUDA_VISIBLE_DEVICES=0,1 bash pretrain_src/run_pt/run_r2r.bash 2333

Finetuning and Evaluation

Use main.bash for Training/Evaluation/Inference with a single GPU or with multiple GPUs on a single node. Simply adjust the arguments of the bash scripts:

# for R2R-CE
CUDA_VISIBLE_DEVICES=0,1 bash run_r2r/main.bash train 2333  # training
CUDA_VISIBLE_DEVICES=0,1 bash run_r2r/main.bash eval  2333  # evaluation
CUDA_VISIBLE_DEVICES=0,1 bash run_r2r/main.bash inter 2333  # inference

# for RxR-CE
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_rxr/main.bash train 2333  # training
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_rxr/main.bash eval  2333  # evaluation
CUDA_VISIBLE_DEVICES=0,1,2,3 bash run_rxr/main.bash inter 2333  # inference

Contact Information

dong DOT an AT cripac DOT ia DOT ac DOT cn, Dong An
hanqingwang AT bit DOT edu DOT cn, Hanqing Wang
wenguanwang DOT ai AT gmail DOT com, Wenguan Wang
yhuang AT nlpr DOT ia DOT ac DOT cn, Yan Huang

Acknowledge

Our implementations are partially inspired by CWP, Sim2Sim and DUET.

Thanks for their great works!

Citation

If you find this repository is useful, please consider citing our paper:

@article{an2024etpnav,
  title={ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments},
  author={An, Dong and Wang, Hanqing and Wang, Wenguan and Wang, Zun and Huang, Yan and He, Keji and Wang, Liang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024}
}

MarSaKi / ETPNav

readme