FoundationVision / GLEE

[CVPR2024 Highlight]GLEE: General Object Foundation Model for Images and Videos at Scale
https://glee-vision.github.io/
MIT License
1.09k stars 85 forks source link
foundation-model interactive-segmentation object-detection open-vocabulary-detection open-vocabulary-segmentation open-vocabulary-video-segmentation open-world referring-expression-comprehension referring-expression-segmentation referring-video-object-segmentation segment-anything tracking video-instance-segmentation video-object-segmentation zero-shot-object-detection

GLEE: General Object Foundation Model for Images and Videos at Scale

Junfeng Wu*, Yi Jiang*, Qihao Liu, Zehuan Yuan, Xiang Bai,and Song Bai

* Equal Contribution, Correspondence

[Project Page] [Paper] [HuggingFace Demo] [Video Demo]

PWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWCPWC

data_demo

Highlight:

We will release the following contents for GLEE:exclamation:

Getting started

  1. Installation: Please refer to INSTALL.md for more details.
  2. Data preparation: Please refer to DATA.md for more details.
  3. Training: Please refer to TRAIN.md for more details.
  4. Testing: Please refer to TEST.md for more details.
  5. Model zoo: Please refer to MODEL_ZOO.md for more details.

Run the demo APP

Try our online demo app on [HuggingFace Demo] or use it locally:

git clone https://github.com/FoundationVision/GLEE
# support CPU and GPU running
python app.py

Introduction

GLEE has been trained on over ten million images from 16 datasets, fully harnessing both existing annotated data and cost-effective automatically labeled data to construct a diverse training set. This extensive training regime endows GLEE with formidable generalization capabilities.

data_demo

GLEE consists of an image encoder, a text encoder, a visual prompter, and an object decoder, as illustrated in Figure. The text encoder processes arbitrary descriptions related to the task, including 1) object category list 2)object names in any form 3)captions about objects 4)referring expressions. The visual prompter encodes user inputs such as 1) points 2) bounding boxes 3) scribbles during interactive segmentation into corresponding visual representations of target objects. Then they are integrated into a detector for extracting objects from images according to textual and visual input.

pipeline

Based on the above designs, GLEE can be used to seamlessly unify a wide range of object perception tasks in images and videos, including object detection, instance segmentation, grounding, multi-target tracking (MOT), video instance segmentation (VIS), video object segmentation (VOS), interactive segmentation and tracking, and supports open-world/large-vocabulary image and video detection and segmentation tasks.

Results

Image-level tasks

imagetask

odinw

Video-level tasks

videotask

visvosrvos`

Citing GLEE

@misc{wu2023GLEE,
  author= {Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai},
  title = {General Object Foundation Model for Images and Videos at Scale},
  year={2023},
  eprint={2312.09158},
  archivePrefix={arXiv}
}

Acknowledgments