:grapes: [Read our arXiv Paper] :apple: [Try our Demo]
In this work, we introduce DINOv, a Visual In-Context Prompting framework for referring and generic segmentation tasks.
For visualization and demos, we also recommend trying T-Rex demo link, which is another visual prompting tool in our team with similar properties as DINOv.
pip3 install torch==1.13.1 torchvision==0.14.1 --extra-index-url https://download.pytorch.org/whl/cu113
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
pip install git+https://github.com/cocodataset/panopticapi.git
git clone https://github.com/UX-Decoder/DINOv
cd DINOv
python -m pip install -r requirements.txt
python demo_openset.py --ckpt /path/to/swinL/ckpt
👉: Related projects:
We jointly train on COCO and SA-1B data. Please refer to prepare SA-1B data and prepare coco data.
For evaluation, you need to prepare
The currently released checkpoints are trained with SA-1B and COCO data.
Name | Training Dataset | Backbone | PQ (COCO) | PQ (ADE) | download |
---|---|---|---|---|---|
DINOv | config | SA-1B, COCO | SwinT | 49.0 | 19.4 | model |
DINOv | config | SA-1B, COCO | SwinL | 57.7 | 23.2 | model |
We do detection evaluation on COCO val2017.
$n
is the number of gpus you use
Process visual prompt embeddings for inference. We calculate the all the instance prompt embeddings of the validate set (you can also use the training set, but the processing time is much longer) and store them. Then we infrence by randomly selecting some visual prompts as in-context examples.
python train_net.py --eval_only --resume --eval_get_content_features --num-gpus 8 --config-file /path/to/configs COCO.TEST.BATCH_SIZE_TOTAL=8 MODEL.WEIGHTS=/path/to/weights OUTPUT_DIR=/path/to/outputs
python train_net.py --eval_only --resume --eval_visual_openset --num-gpus 8 --config-file /path/to/configs COCO.TEST.BATCH_SIZE_TOTAL=8 MODEL.WEIGHTS=/path/to/weights MODEL.DECODER.INFERENCE_EXAMPLE=16 OUTPUT_DIR=/path/to/outputs
configs/dinov_sam_coco_train.yaml
for swinT and configs/dinov_sam_coco_swinl_train.yaml
for swinL.configs/dinov_sam_ade_eval.yaml
and adjust the batchsize of ADE evaluation to the correct number.OUTPUT_DIR
is the dir to store the visual prompt embeddingsINFERENCE_EXAMPLE
number of in-context examples to represent a category. Default set to 16.
We evaluate under the DAVIS 2017 Semi-supervised
setting, please refer to davis2017-evaluation for more details.
The first step is to compute and store the results of DAVIS2017. We implement a navie memory-aware approach with our in-context visual prompting.
python train_net.py --eval_track_prev --eval_only --resume --num-gpus 8 --config-file configs/dinov_sam_coco_train.yaml DAVIS.TEST.BATCH_SIZE_TOTAL=8 OUTPUT_DIR=$outdir MODEL.WEIGHTS=/path/to/weights MODEL.DECODER.NMS_THRESHOLD=0.9 MODEL.DECODER.MAX_MEMORY_SIZE=9 OUTPUT_DIR=/path/to/outputs
The second step is to evaluate the semi-supervised results.
python evaluation_method.py --task semi-supervised --results_path /path/to/results --davis_path /path/to/davis/data
We currently release the code of training on SA-1B and COCO. It can also support Objects365 and other datasets with minimal modifications.
$n
is the number of gpus you use
before running the training code, you need to specify your training data of SA-1B.
export DETECTRON2_DATASETS=/pth/to/cdataset # path to coco, ade
export SAM_DATASET=/pth/to/sam_dataset # patch to sa-1b data
export SAM_DATASET_START=$start
export SAM_DATASET_END=$end
We convert SA-1B data into 100 tsv files. start
(int, 0-99) is the start of your SA-1B data index and end
(int, 0-99) is the end of your data index.
You can refer to Semantic-SAM json registration for SAM for a reference on the data preparation.
We recommend using total batchsize 64
for training, which provides enough postive and negative samples for contrastive learning.
For SwinT backbone
python train_net.py --resume --num-gpus 8 --config-file configs/dinov_sam_coco_train.yaml SAM.TRAIN.BATCH_SIZE_TOTAL=8 COCO.TRAIN.BATCH_SIZE_TOTAL=8
For SwinL backbone
python train_net.py --resume --num-gpus 8 --config-file configs/dinov_sam_coco_swinl_train.yaml SAM.TRAIN.BATCH_SIZE_TOTAL=8 COCO.TRAIN.BATCH_SIZE_TOTAL=8
MODEL.DECODER.COCO_TRACK=True
to enable this task, which can improve the referring segmentation performance on DAVIS.
If you find our work helpful for your research, please consider citing the following BibTeX entry.
@article{li2023visual,
title={Visual In-Context Prompting},
author={Li, Feng and Jiang, Qing and Zhang, Hao and Ren, Tianhe and Liu, Shilong and Zou, Xueyan and Xu, Huaizhe and Li, Hongyang and Li, Chunyuan and Yang, Jianwei and others},
journal={arXiv preprint arXiv:2311.13601},
year={2023}
}