CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection,
Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
NeurIPS 2023 (https://arxiv.org/abs/2310.16667)
Project page (https://codet-ovd.github.io)

Features

Train an open-vocabulary detector with web-scale image-text pairs
Align regions and words by co-occurrence instead of region-text similarity
State-of-the-art performance on open-vocabulary LVIS
Deployed with modern visual foudation models
Intergated with roboflow to automatically label images for training a small, fine-tuned model

Installation

Setup environment

conda create --name codet python=3.8 -y && conda activate codet
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
git clone https://github.com/CVMI-Lab/CoDet.git

Install Apex and xFormer (You can skip this part if you do not use EVA-02 backbone)

pip install ninja
pip install -v -U git+https://github.com/facebookresearch/xformers.git@7e05e2caaaf8060c1c6baadc2b04db02d5458a94
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./ && cd ..

Install detectron2 and other dependencies

cd CoDet/third_party/detectron2
pip install -e .
cd ../..
pip install -r requirements.txt

Prepare Datasets

We use LVIS and Conceptual Caption (CC3M) for OV-LVIS experimets, COCO for OV-COCO experiments, and Objects365 for cross-dataset evaluation. Before starting processing, please download the (selected) datasets from the official websites and place or sim-link them under CoDet/datasets/. CoDet/datasets/metadata/ is the preprocessed meta-data (included in the repo). Please refer to DATA.md for more details.

$CoDet/datasets/
    metadata/
    lvis/
    coco/
    cc3m/
    objects365/

Model Zoo

OV-COCO

Backbone	Box AP50	Box AP50_novel	Config	Model
ResNet50	46.8	30.6	CoDet_OVCOCO_R50_1x.yaml	ckpt

OV-LVIS

Backbone	Mask mAP	Mask mAP_novel	Config	Model
ResNet50	31.3	23.7	CoDet_OVLVIS_R5021k_4x_ft4x.yaml	ckpt
Swin-B	39.2	29.4	CoDet_OVLVIS_SwinB_4x_ft4x.yaml	ckpt
EVA02-L	44.7	37.0	CoDet_OVLVIS_EVA_4x.yaml	ckpt

Inference

To test with custom images/videos, run

python demo.py --config-file [config_file] --input [your_image_file] --output [output_file_path] --vocabulary lvis --opts MODEL.WEIGHTS [model_weights]

Or you can customize the test vocabulary, e.g.,

python demo.py --config-file [config_file] --input [your_image_file] --output [output_file_path] --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS [model_weights]

To evaluate a pre-trained model, run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config --eval-only MODEL.WEIGHTS /path/to/ckpt

To evaluate a pre-trained model on Objects365 (cross-dataset evaluation), run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config --eval-only MODEL.WEIGHTS /path/to/ckpt DATASETS.TEST "('objects365_v2_val',)" MODEL.RESET_CLS_TESTS True MODEL.TEST_CLASSIFIERS "('datasets/metadata/o365_clip_a+cnamefix.npy',)" MODEL.TEST_NUM_CLASSES "(365,)" MODEL.MASK_ON False

Training

Training configurations used by the paper are listed in CoDet/configs. Most config files require pre-trained model weights for initialization (indicated by MODEL.WEIGHTS in the config file). Please train or download the corresponding pre-trained models and place them under CoDet/models/ before training.

Name	Model
resnet50_miil_21k.pkl	ResNet50-21K pretrain from MIIL
swin_base_patch4_window7_224_22k.pkl	SwinB-21K pretrain from Swin-Transformer
eva02_L_pt_m38m_p14to16.pt	EVA02-L mixed 38M pretrain from EVA
BoxSup_OVCOCO_CLIP_R50_1x.pth	ResNet50 COCO base class pretrain from Detic
BoxSup-C2_Lbase_CLIP_R5021k_640b64_4x.pth	ResNet50 LVIS base class pretrain from Detic
BoxSup-C2_Lbase_CLIP_SwinB_896b32_4x.pth	SwinB LVIS base class pretrain from Detic

To train on a single node, run

python train_net.py --num-gpus $GPU_NUM --config-file /path/to/config

Note: By default, we use 8 V100 for training with ResNet50 or SwinB, and 16 A100 for training with EVA02-L. Please remember to re-scale the learning rate accordingly if you are using a different number of GPUs for training.

Citation

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{ma2023codet,
  title={CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection},
  author={Ma, Chuofan and Jiang, Yi and Wen, Xin and Yuan, Zehuan and Qi, Xiaojuan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2023}
}

Acknowledgment

CoDet is built upon the awesome works Detic and EVA.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

CVMI-Lab / CoDet

readme