elicassion / 3DTRL

Code for NeurIPS 2022 paper "Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space"
https://elicassion.github.io/3dtrl/3dtrl.html
MIT License
18 stars 0 forks source link
3d-models action-recognition deep-learning image-classification pytorch video-alignment

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

by Jinghuan Shang, Srijan Das and Michael S. Ryoo at NeurIPS 2022

We present 3DTRL, a plug-and play layer in Transformer using 3D camera transformations to recover tokens in 3D that learns viewpoint-agnostic representations. Check our paper and project page for more details.

Quick link: [Usage] [Dataset] [Image Classification] [Action Recognition] [Video Alignment]

By 3DTRL, we can align videos from multiple viewpoints, even including ego-centric view and third-person view videos.         Third-person view             First-person view GT                         Ours                                         DeiT+TCN            

Multi-view Video Alignment Results

3DTRL recovers pseudo-depth of images -- getting semantically meaningful results. Pseudo-depth

Overview of 3DTRL 3DTRL

Usage

Directory Structure

├── _doc                            # images, gifs, etc for readme
├── action_recognition              # all files related to action recognition go here, this can work stand alone
    ├── configs                     # config files for TimeSformer and +3DTRL
    ├── timesformer
        ├── datasets                # data pipeline for action recognition
        ├── models                  # definitions of TimeSformer and +3DTRL
    ├── script.sh                   # launch script for action recognition
├
├── backbone                        # modules used by 3DTRL (depth and camera estimators)
├── model                           # Transformer models with 3DTRL plug-in (ViT, Swin, TnT)
├── data_pipeline                   # dataset class for video alignment
├── i1k_configs                     # Configuration files for ImageNet-1K training
├
├── 3dtrl_env.yml                   # conda env for image classification and video alignment
├── i1k.sh                          # launch script for ImageNet-1K jobs
├── imagenet_train.py               # entry point of ImageNet-1K training
├── imagenet_val.py                 # entry point of ImageNet-1K evaluation
├── multiview_video_alignment.py    # entry point of video alignment
├── utils.py                        # some utility functions

Image Classification

Environment:

conda env create -f 3dtrl_env.yml

Run:

conda activate 3dtrl
bash i1k.sh num_gpu your_imagenet_dir

Credit: We build our code for image classification on top of timm.

Video Alignment

FTPV Dataset

We release the First-Third Person View (FTPV) dataset (including MC, Panda, Lift, and Can used in our paper) at Google Drive. Download and unzip it. Please consider cite our paper if you use the datasets. Note: I also include Pouring dataset introduced by TCN paper in the drive. The reason is that I got a hard time to find a valid source to download it when doing my research. I'm re-sharing it for your convenience. Please cite them if you use Pouring.

Environment:

conda env create -f 3dtrl_env.yml

Run:

conda activate 3dtrl
python multiview_video_alignment.py --data dataset_name [--model vit_3dtrl] [--train_videos num_video_used]

Action Recognition

Environment: we follow TimeSformer to set up the virtual environment. Then,

cd action_recognition
bash script.sh your_config_file data_location log_location

Cite 3DTRL

If you find our research useful, please consider cite:

@inproceedings{
    3dtrl,
    title={Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space},
    author={Jinghuan Shang and Srijan Das and Michael S Ryoo},
    booktitle={Advances in Neural Information Processing Systems},
    year={2022},
}