anishmadan23 / foundational_fsod

This repository contains the implementation for the paper "Revisiting Few Shot Object Detection with Vision-Language Models"
https://arxiv.org/abs/2312.14494
Apache License 2.0
17 stars 1 forks source link
benchmark few-shot-learning few-shot-object-detection foundation-models lvis nuimages object-detection vision-language-model

Revisiting Few Shot Object Detection with Vision-Language Models

arXiv models challenge

SWITCH TO MQDET BRANCH FOR RUNNING MQDET EXPTS

IMP NOTE: Use the test_set.json file for evaluating performance.

Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan

teaser.png

:star: Foundational FSOD Challenge

We are releasing a Foundational FSOD challenge as part of the Workshop on Visual Perception and Learning in an Open World at CVPR 2024. We are accepting submissions till 7th June 2024!

Abstract

Few-shot object detection (FSOD) benchmarks have advanced techniques for detecting new categories with limited annotations. Existing benchmarks repurpose wellestablished datasets like COCO by partitioning categories into base and novel classes for pre-training and finetuning respectively. However, these benchmarks do not reflect how FSOD is deployed in practice. Rather than only pre-training on a small number of base categories, we argue that it is more practical to fine-tune a foundation model (e.g., a vision-language model (VLM) pre-trained on webscale data) for a target domain. Surprisingly, we find that zero-shot inference from VLMs like GroundingDINO significantly outperforms the state-of-the-art (48.3 vs. 33.1 AP) on COCO. However, such zero-shot models can still be misaligned to target concepts of interest. For example, trailers on the web may be different from trailers in the context of autonomous vehicles. In this work, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on K-shots per target class. Further, we note that current FSOD benchmarks are actually federated datasets containing exhaustive annotations for each category on a subset of the data. We leverage this insight to propose simple strategies for fine-tuning VLMs with federated losses. We demonstrate the effectiveness of our approach on LVIS and nuImages, improving over prior work by 5.9 AP.

Installation

See installation instructions.

Data

See datasets/README.md

Models

Create models/ in the root directory and download pre-trained model here

Training

python train_net.py --num-gpus 1 --config-file <config_path>  --pred_all_class  OUTPUT_DIR_PREFIX <root_output_dir>

Config Details

python tools/convert_preds_to_ann.py --pred_path_train <path_trainset_eval_pth_file> --dataset_name nuimages_fsod_train_seed_0_shots_10 --conf_thresh 0.2
  1. Set ROI_BOX_HEAD.ALL_ANN_FILE to the generated predictions.

Inference

python train_net.py --num-gpus 8 --config-file <config_path>  --pred_all_class --eval-only  MODEL.WEIGHTS <model_path> OUTPUT_DIR_PREFIX <root_output_dir>

TODO

Acknowledgment

We thank the authors of the following repositories for their open-source implementations which were used in building the current codebase:

  1. Detic: Detecting Twenty-thousand Classes using Image-level Supervision
  2. Detectron2

Citation

If you find our paper and code repository useful, please cite us:

@article{madan2023revisiting,
  title={Revisiting Few-Shot Object Detection with Vision-Language Models},
  author={Madan, Anish and Peri, Neehar and Kong, Shu and Ramanan, Deva},
  journal={arXiv preprint arXiv:2312.14494},
  year={2023}
}