avlmaps / AVLMaps

[ISER 2023] The official implementation of Audio Visual Language Maps for Robot Navigation
https://avlmaps.github.io/
MIT License
27 stars 3 forks source link

AVLMaps

Code style: black Open In Colab License: MIT Static Badge

Audio Visual Language Maps for Robot Navigation

Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard

We present AVLMAPs (Audio Visual Language Maps), an open-vocabulary 3D map representation for storing cross-modal information from audio, visual, and language cues. When combined with large language models, AVLMaps consumes multimodal prompts from audio, vision and language to solve zero-shot spatial goal navigation by effectively leveraging complementary information sources to disambiguate goals.

cover_lady

Quick Start

Try AVLMaps creation and landmark indexing in Open In Colab.

Setup Environment

To begin on your own machine, clone this repository locally

git clone https://github.com/avlmaps/AVLMaps.git

Install requirements:

$ conda create -n avlmaps python=3.8 -y  # or use virtualenv
$ conda activate avlmaps
$ conda install jupyter -y
$ cd AVLMaps
$ bash install.bash

Download Checkpoints

You can download the AudioCLIP and LSeg checkpoints with the following command:

bash download_checkpoints.bash

Generate Dataset

To build AVLMaps for simulated environments, we manually collected RGB-D videos among 10 scenes in Habitat simulator with Matterport3D dataset. We provide script and pose meta data to generate the RGB-D videos. We also collect 20 sequences of RGB videos with poses for each scene and insert audios from ESC-50 dataset to create audio videos. Please follow the next few steps to generate the dataset.

Download ESC50 dataset

We need to download the source ESC50 audio dataset with the following command. For more information, please check the official repo: https://github.com/karolpiczak/ESC-50.

wget https://github.com/karoldvl/ESC-50/archive/master.zip -P ~/
unzip ~/master.zip -d <target_dir>

The extracted ESC-50 dataset is under the directory <target_dir>/ESC-50-master. You need to modify the paths in config/data_paths/default.yaml:

Download Matterpot3D dataset

Please check Dataset Download, sign the Terms of Use, and send to the responsible person to request the Matterport3D mesh for the use in Habitat simulator. The return email will attach a python script to download the data. Copy and paste the script to a file ~/download_mp.py. Run the following to download the data:

cd ~
# download the data at the current directory
python2 download_mp.py -o . --task habitat
# unzip the data
unzip v1/tasks/mp3d_habitat.zip
# the data_dir is mp3d_habitat/mp3d

Modify the paths in config/data_paths/default.yaml:

Download and Generate AVLMaps Dataset

Configure the config/generate_dataset.yaml * Change the value for `defaults/data_paths` in `config/generate_dataset.yaml` to `default`. * Change the `avlmaps_data_dir` to the where you want to download the dataset * Change `data_cfg.resolution.w` and `data_cfg.resolution.h` to adjust the resolution of the generated rgb, depth, and semantic images. * Change `rgb`, `depth`, and `semantic` to `true` to generate corresponding data, and to `false` to ignore corresponding data. * Change `camera_height` to change the height of camera relative to the robot base

Run the following command to download and generate the dataset. The generated dataset takes around 150GB disk space.

# go to <REPO_ROOT>/dataset of this repository
python dataset/generate_dataset.py
After the data generation, the data structure will look like the following

# the structure of the avlmaps_data_dir will look like this
avlmaps_data_dir
├── 5LpN3gDmAk7_1
│   ├── poses.txt
│   ├── audio_video
│   │   ├── 000000
│   │   │   ├── meta.txt
│   │   │   ├── poses.txt
│   │   │   ├── output.mp4
│   │   │   ├── output_level_1.wav
│   │   │   ├── output_level_2.wav
│   │   │   ├── output_level_3.wav
│   │   │   ├── output_with_audio_level_1.mp4
│   │   │   ├── output_with_audio_level_2.mp4
│   │   │   ├── output_with_audio_level_3.mp4
│   │   │   ├── range_and_audio_meta_level_1.txt
│   │   │   ├── range_and_audio_meta_level_2.txt
│   │   │   ├── range_and_audio_meta_level_3.txt
│   │   │   ├── rgb
│   │   │   |   ├── 000000.png
│   │   │   |   ├── ...
│   │   ├── 000001
│   │   ├── ...
│   ├── depth
│   │   ├── 000000.npy
│   │   ├── ...
│   ├── rgb
│   │   ├── 000000.png
│   │   ├── ...
│   ├── semantic
│   │   ├── 000000.npy
│   │   ├── ...
├── gTV8FGcVJC9_1
│   ├── ...
├── jh4fc5c5qoQ_1
│   ├── ...
...
  

The details of the structure of data are explained in the dataset README.

Create a AVLMap with the Generated Dataset

Config the Created AVLMap

Index an AVLMap

Configure the Indexing

Citation

If you find the dataset or code useful, please cite:

@inproceedings{huang23avlmaps,
              title={Audio Visual Language Maps for Robot Navigation},
              author={Chenguang Huang and Oier Mees and Andy Zeng and Wolfram Burgard},
              booktitle={Proceedings of the International Symposium on Experimental Robotics (ISER)},
              year={2023},
              address = {Chiang Mai, Thailand}
          }

License

MIT License

Acknowledgement

We extend our heartfelt gratitude to the authors of the projects listed below for generously sharing their code with the public, thus greatly facilitating our research on AVLMaps:

Your contribution is invaluable to our work, and we deeply appreciate your commitment to advancing the field.