Generative Region-Language Pretraining for Open-Ended Object Detection

Chuang Lin Yi Jiang Lizhen Qu Zehuan Yuan Jianfei Cai

Monash University ByteDance Inc.

CVPR 2024

⭐ If GenerateU is helpful to your projects, please help star this repo. Thanks! 🤗 ---

Highlight

GenerateU is accepted by CVPR2024.
We introduce generative open-ended object detection, which is a more general and practical setting where categorical information is not explicitly defined. Such a setting is especially meaningful for scenarios where users lack precise knowledge of object cate- gories during inference.
Our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference.

Results

Zero-shot domain transfer to LVIS

pseudo-label_examples

Visualizations

👨🏻‍🎨 Pseudo-label Examples

pseudo-label_examples

🎨 Zero-shot LVIS

pseudo-label_examples

Overview

overall_structure

Dependencies and Installation

Clone Repo

git clone https://github.com/clin1223/GenerateU.git

Create Conda Environment and Install Dependencies

# create new anaconda env
conda create -n GenerateU python=3.8 -y
conda activate GenerateU

# install python dependencies
pip3 install -e . --user
pip3 install -r requirements.txt 

# compile Deformable DETR
cd projects/DDETRS/ddetrs/models/deformable_detr/ops
bash make.sh

CUDA >= 11.3
PyTorch >= 1.10.0
Torchvision >= 0.11.1
Other required packages in requirements.txt

Get Started

Prepare pretrained models

Download our pretrained models from here to the weights folder. For training, prepare the backbone weight Swin-Tiny and Swin-Large following instruction in tools/convert-pretrained-swin-model-to-d2.py

The directory structure will be arranged as:

weights
   |- vg_swinT.pth
   |- vg_swinL.pth
   |- vg_grit5m_swinT.pth
   |- vg_grit5m_swinL.pth
   |- swin_tiny_patch4_window7_224.pkl
   |- swin_large_patch4_window12_384_22k.pkl

Dataset preparation

VG Dataset

Download images from VG official website
Download our pre-processed annotations: train_from_objects.json

LVIS Dataset

Download validation images from COCO official website
Download validation annotations same as GLIP: lvis_v1_minival.json
Download LVIS category text embedding for mapping

(Optional) GrIT-20M Dataset

Download images from GrIT-20M official website
Run Evaluation on GrIT images to generate pseudo lables.

Dataset strcture should look like:

  |-- datasets
  `-- |-- vg
      |-- |-- images/
      |-- |-- train_from_objects.json
   `-- |-- lvis
      |-- |-- val2017/
      |-- |-- lvis_v1_minival.json
      |-- |-- lvis_v1_clip_a+cname_ViT-H.npy

Training

By default, we train GenerateU using 16 A100 GPUs. You can also train on a single node, but this might prevent you from reproducing the results presented in the paper.

Single-Node Training

When pretraining with VG, single node is enough. On a single node with 8 GPUs, run

python3 launch.py --nn 1 --uni 1 \
--config-file projects/DDETRS/configs/vg_swinT.yaml OUTPUT_DIR outputs/${EXP_NAME}

Multiple-Node Training

# On node 0, run
python3 launch.py --nn 2 --port <PORT> --worker_rank 0 --master_address <MASTER_ADDRESS> \
--uni 1 --config-file /path/to/config/name.yaml  OUTPUT_DIR outputs/${EXP_NAME}
# On node 1, run
python3 launch.py --nn 2 --port <PORT> --worker_rank 1 --master_address <MASTER_ADDRESS> \
--uni 1 --config-file /path/to/config/name.yaml OUTPUT_DIR outputs/${EXP_NAME}

<MASTER_ADDRESS> should be the IP address of node 0. <PORT> should be the same among multiple nodes. If <PORT> is not specifed, programm will generate a random number as <PORT>.

Evaluation

To evaluate a model with a trained/ pretrained model, run

python3 launch.py --nn 1 --eval-only --uni 1 --config-file /path/to/config/name.yaml  \
OUTPUT_DIR outputs/${EXP_NAME}  MODEL.WEIGHTS /path/to/weight.pth

Citation

If you find our repo useful for your research, please consider citing our paper:

   @inproceedings{lin2024generateu,
      title={Generative Region-Language Pretraining for Open-Ended Object Detection},
      author={Chuang, Lin and Yi, Jiang and Lizhen, Qu and Zehuan, Yuan and Jianfei, Cai},
      booktitle={Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
      year={2024}
   }

Contact

If you have any questions, please feel free to reach me out at chuang.lin@monash.edu.

Acknowledgement

This code is based on UNINEXT. Some code are brought from FlanT5. Thanks for their awesome works.

Special thanks to Bin Yan and Junfeng Wu for their valuable contributions.

FoundationVision / GenerateU

readme

Generative Region-Language Pretraining for Open-Ended Object Detection

Highlight

Results

Zero-shot domain transfer to LVIS

Visualizations

👨🏻‍🎨 Pseudo-label Examples

🎨 Zero-shot LVIS

Overview

Dependencies and Installation

Get Started

Prepare pretrained models

Dataset preparation

VG Dataset

LVIS Dataset

(Optional) GrIT-20M Dataset

Training

Single-Node Training

Multiple-Node Training

Evaluation

Citation

Contact

Acknowledgement