This repository is an official implementation of CountFormer. CountFormer is a concise 3D multi-view counting (MVC) framework towards deployment in real-world deployment.
Framework of the CountFormer. The Image Encoder extracts multi-view and multi-level features (MVML) from the multi-view images of the scene. ImageLevel Camera Embedding Module fuses camera intrinsic and extrinsic with the MVML features. The elaborate Cross-View Attention Module, a sophisticated attention component, transforms the image-level features into scene-level volume representations. Besides main components, a 2D Density Predictor is used to estimate the image space density, 3D Density Predictors are employed to regress for the 3D scene-level density, and a simple feature pyramid network fuses the multi-scale voxel features.
2024.07.08 The code of CountFormer is released on github for research purpose.
2024.07.01 The CountFormer has been accepted by the Top-tier conference ECCV 2024.
After preparation, you will be able to see the following directory structure:
CountFormer
├── data
│ ├── cross_view
│ ├── citystreet
│ ├── ....
├── projects
│ ├── configs
│ ├── dataset
│ ├── modules
│ ├── registry
│ ├── ....
├── tools
├── README.md
sh tools/do_train.sh
Note that the training of CountFormer necessitate training 3 days on 8x A100 GPUs (80GB)
If you find SparseDrive useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@inproceedings{mo2024countformer,
title={CountFormer: Multi-View Crowd Counting Transformer},
author={Mo, Hong and Zhang, Xiong and Tan, Jianchao and Yang, Cheng and Gu, Qiong and Hang, Bo and Ren, Wenqi},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024},
organization={Springer},
}