URL

https://arxiv.org/abs/2311.13601
Affiliations
- Feng Li, N/A
- Qing Jiang, N/A
- Hao Zhang, N/A
- Tianhe Ren, N/A
- Shilong Liu, N/A
- Xueyan Zou, N/A
- Huaizhe Xu, N/A
- Hongyang Li, N/A
- Chunyuan Li, N/A
- Jianwei Yang, N/A
- Lei Zhang, N/A
- Jianfeng Gao, N/A
  Abstract
- In-context prompting in large language models (LLMs) has become a prevalentapproach to improve zero-shot capabilities, but this idea is less explored inthe vision domain. Existing visual prompting methods focus on referringsegmentation to segment the most relevant object, falling short of addressingmany generic vision tasks like open-set segmentation and detection. In thispaper, we introduce a universal visual in-context prompting framework for bothtasks. In particular, we build on top of an encoder-decoder architecture, anddevelop a versatile prompt encoder to support a variety of prompts likestrokes, boxes, and points. We further enhance it to take an arbitrary numberof reference image segments as the context. Our extensive explorations showthat the proposed visual in-context prompting elicits extraordinary referringand generic segmentation capabilities to refer and detect, yielding competitiveperformance to close-set in-domain datasets and showing promising results onmany open-set segmentation datasets. By joint training on COCO and SA-1B, ourmodel achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will beavailable at https://github.com/UX-Decoder/DINOv.
  Translation (by gpt-3.5-turbo)
大規模言語モデル（LLMs）におけるコンテキスト内プロンプティングは、ゼロショット能力を向上させるための一般的な手法となっていますが、このアイデアはビジョン領域ではあまり探求されていません。既存の視覚的なプロンプティング手法は、最も関連性の高いオブジェクトをセグメント化することに焦点を当てており、オープンセットのセグメンテーションや検出などの多くの一般的なビジョンタスクに対応できていません。本論文では、両方のタスクに対して汎用的なビジュアルインコンテキストプロンプティングフレームワークを紹介します。具体的には、エンコーダーデコーダーアーキテクチャをベースに構築し、ストローク、ボックス、ポイントなどさまざまなプロンプトをサポートする汎用的なプロンプトエンコーダーを開発します。さらに、任意の数の参照画像セグメントをコンテキストとして受け取るように拡張します。私たちの詳細な探索は、提案されたビジュアルインコンテキストプロンプティングが非凡な参照および一般的なセグメンテーション能力を引き出し、クローズセットのドメイン内データセットに競争力のあるパフォーマンスを提供し、多くのオープンセットセグメンテーションデータセットで有望な結果を示すことを示しています。COCOとSA-1Bでの共同トレーニングにより、モデルはCOCOで57.7 PQ、ADE20Kで23.2 PQを達成します。コードはhttps://github.com/UX-Decoder/DINOvで利用可能です。
Summary (by gpt-3.5-turbo)
本研究では、ビジョン領域における汎用的なビジュアルインコンテキストプロンプティングフレームワークを提案します。エンコーダーデコーダーアーキテクチャを使用し、さまざまなプロンプトをサポートするプロンプトエンコーダーを開発しました。さらに、任意の数の参照画像セグメントをコンテキストとして受け取るように拡張しました。実験結果から、提案手法が非凡な参照および一般的なセグメンテーション能力を引き出し、競争力のあるパフォーマンスを示すことがわかりました。

AkihikoWatanabe / paper_notes

Visual In-Context Prompting, Feng Li+, N/A, arXiv'23 #1160

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)