UX-Decoder / DINOv

[CVPR 2024] Official implementation of the paper "Visual In-context Learning"
321 stars 11 forks source link

Visual In-Context Prompting

:grapes: [Read our arXiv Paper]   :apple: [Try our Demo]

In this work, we introduce DINOv, a Visual In-Context Prompting framework for referring and generic segmentation tasks.

For visualization and demos, we also recommend trying T-Rex demo link, which is another visual prompting tool in our team with similar properties as DINOv.

teaser

:hammer_and_wrench: Installation

pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install git+https://github.com/cocodataset/panopticapi.git
git clone https://github.com/UX-Decoder/DINOv
cd DINOv
python -m pip install -r requirements.txt

:point_right: Launch a demo for visual in-context prompting

python demo_openset.py --ckpt /path/to/swinL/ckpt

Openset segmentation

generic_seg_vis

Panoptic segmentation

panoptic_vis

👉: Related projects:

:unicorn: Getting Started

:mosque: Data preparation

We jointly train on COCO and SA-1B data. Please refer to prepare SA-1B data and prepare coco data.

For evaluation, you need to prepare

:volcano: Model Zoo

The currently released checkpoints are trained with SA-1B and COCO data.

Name Training Dataset Backbone PQ (COCO) PQ (ADE) download
DINOv | config SA-1B, COCO SwinT 49.0 19.4 model
DINOv | config SA-1B, COCO SwinL 57.7 23.2 model

:sunflower: Evaluation

We do detection evaluation on COCO val2017. $n is the number of gpus you use

Process visual prompt embeddings for inference. We calculate the all the instance prompt embeddings of the validate set (you can also use the training set, but the processing time is much longer) and store them. Then we infrence by randomly selecting some visual prompts as in-context examples.

Evaluate Open-set detection and segmentation

The first step is to compute and store the results of DAVIS2017. We implement a navie memory-aware approach with our in-context visual prompting.

python train_net.py --eval_track_prev --eval_only --resume --num-gpus 8 --config-file configs/dinov_sam_coco_train.yaml DAVIS.TEST.BATCH_SIZE_TOTAL=8 OUTPUT_DIR=$outdir MODEL.WEIGHTS=/path/to/weights MODEL.DECODER.NMS_THRESHOLD=0.9 MODEL.DECODER.MAX_MEMORY_SIZE=9 OUTPUT_DIR=/path/to/outputs

The second step is to evaluate the semi-supervised results.

python evaluation_method.py --task semi-supervised --results_path /path/to/results --davis_path /path/to/davis/data

We recommend using total batchsize 64 for training, which provides enough postive and negative samples for contrastive learning.

For SwinT backbone

python train_net.py --resume --num-gpus 8 --config-file configs/dinov_sam_coco_train.yaml SAM.TRAIN.BATCH_SIZE_TOTAL=8 COCO.TRAIN.BATCH_SIZE_TOTAL=8

For SwinL backbone

python train_net.py --resume --num-gpus 8 --config-file configs/dinov_sam_coco_swinl_train.yaml SAM.TRAIN.BATCH_SIZE_TOTAL=8 COCO.TRAIN.BATCH_SIZE_TOTAL=8

Model framework

framework query_formulation

Results

Open-set detection and segmentation

image

Video object segmentation

image

:black_nib: Citation

If you find our work helpful for your research, please consider citing the following BibTeX entry.


@article{li2023visual,
  title={Visual In-Context Prompting},
  author={Li, Feng and Jiang, Qing and Zhang, Hao and Ren, Tianhe and Liu, Shilong and Zou, Xueyan and Xu, Huaizhe and Li, Hongyang and Li, Chunyuan and Yang, Jianwei and others},
  journal={arXiv preprint arXiv:2311.13601},
  year={2023}
}