TAPTR: Tracking Any Point with TRansformers as Detection

By Hongyang Li, Hao Zhang, Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, and Lei Zhang ^:email:.

Project Page | Demo-PointTrajectory | Demo-TrackAnyArea

Arxiv-TAPTR | Arxiv-TAPTRv2 | BibTeX

:fire: News

[2024/8/2] We add memory efficient mode for inference to satisfy more users, please refer to this issue for more details.

[2024/7/24] TAPTRv2 is released, check out our paper for more details!

[2024/7/16] Training, evaluation, and demo codes have been released but have not been cleaned yet. Hope these codes can be helpful to you.

[2024/7/9] TAPTR is accepted by ECCV2024.

[2024/3/15] We release our paper.

:spiral_calendar: TODO List

[x] Release paper.
[x] Release online demos.
[x] Open-source evaluation and demo code.
[x] Training code.
[ ] Clean code. (45% cleaned).

:scroll: Abstract

In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformer (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, each tracking point is represented as a DETR query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through temporal self-attention. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to mitigate the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.

:hammer_and_wrench: Method

Comparison with previous methods

Inspired by detection transformer (DETR), we find that point tracking bears a great resemblance to object detection and tracking. In particular, tracking points can be essentially regarded as queries, which have been extensively studied in DETR-like algorithms. The well-studied DETR-like framework makes our TAPTR conceptually simple yet performance-wise strong.

Overview

The video preparation and query preparation parts provide the multi-scale feature map, point queries, and the cost volumes for the point decoder. The point decoder takes these elements as input and processes all frames in parallel. The outputs of the point decoder are sent to our window post-processing module to update the states of the point queries to their belonging tracking points.

Performance

We evaluate TAPTR on the TAP-Vid benchmark to show its superiority. As shown in the table, TAPTR shows significant superiority compared with previous SoTA methods across the majority of metrics while maintains the advantage in inference speed. To evaluate the tracking speed of different methods fairly, we compare the Point Per Second (PPS), which is the average number of points that a tracker can track across the entire video per second on the DAVIS dataset in the ``First'' mode.

Extensive Ablations

As a baseline method, as shown in the tables, we provide extensive ablation studies to verify the effectiveness of each key component in TAPTR, providing references for future work.

:gear: Usage

We develop and test our method under python=3.8.18,pytorch=1.13.0+cu117,cuda=11.7. Other versions might be available as well.

Prepare datasets

Construct the dataset as in CoTracker, and put it at

kubric data (for training): ./datas/kubric_movif/
tapvid data (for evaluation): 
    ./datas/tapvid/tapvid_davis
    ./datas/tapvid/tapvid_kinetics
    ./datas/tapvid/tapvid_rgb_stacking

Installation

git clone https://github.com/IDEA-Research/TAPTR.git
cd TAPTR
pip install -r requirements.txt
cd models/dino/ops
python setup.py install # This compilation requires nvcc, please make sure you have installed CUDA correctly. CUDA11.7 is tested.

Eval our trained models

Download our provided checkpoint, and put it at "logs/TAPTR/taptr.pth"

# Select the dataset you want to evaluate in evaluate.sh manually. 
bash evaluate.sh

Demos

# Point Trajectory Demo
CUDA_VISIBLE_DEVICES=0 python demo_inter.py -c config/TAPTR.py --path_ckpt logs/TAPTR/taptr.pth
# Video Editing Demo
CUDA_VISIBLE_DEVICES=0 python demo_inter_video_editing.py -c config/TAPTR.py --path_ckpt logs/TAPTR/taptr.pth

Train the models

bash dist_train.sh

:herb: Acknowledgments

We would like to thank TAP-Vid and Co-Tracker for publicly releasing their code and data.

:black_nib: Citation

@inproceedings{
  title={TAPTR: Tracking Any Point with Transformers as Detection},
  author={Hongyang Li and Hao Zhang and Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Lei Zhang},
  booktitle={Proceedings of the IEEE/CVF European Conference on Computer Vision},
  year={2024}
}
@inproceedings{
  title={TAPTRv2: Attention-based Position Update Improves Tracking Any Point},
  author={Hongyang Li and Hao Zhang and Shilong Liu and Zhaoyang Zeng and Feng Li and Tianhe Ren and Bohan Li and Lei Zhang},
  journal={arXiv preprint arXiv:2407.16291
  },
  year={2024}
}

IDEA-Research / TAPTR

readme