URL

https://arxiv.org/abs/2311.06242
Affiliations
- Bin Xiao, N/A
- Haiping Wu, N/A
- Weijian Xu, N/A
- Xiyang Dai, N/A
- Houdong Hu, N/A
- Yumao Lu, N/A
- Michael Zeng, N/A
- Ce Liu, N/A
- Lu Yuan, N/A
  Abstract
- We introduce Florence-2, a novel vision foundation model with a unified,prompt-based representation for a variety of computer vision andvision-language tasks. While existing large vision models excel in transferlearning, they struggle to perform a diversity of tasks with simpleinstructions, a capability that implies handling the complexity of variousspatial hierarchy and semantic granularity. Florence-2 was designed to taketext-prompt as task instructions and generate desirable results in text forms,whether it be captioning, object detection, grounding or segmentation. Thismulti-task learning setup demands large-scale, high-quality annotated data. Tothis end, we co-developed FLD-5B that consists of 5.4 billion comprehensivevisual annotations on 126 million images, using an iterative strategy ofautomated image annotation and model refinement. We adopted asequence-to-sequence structure to train Florence-2 to perform versatile andcomprehensive vision tasks. Extensive evaluations on numerous tasksdemonstrated Florence-2 to be a strong vision foundation model contender withunprecedented zero-shot and fine-tuning capabilities.
  Translation (by gpt-3.5-turbo)
私たちは、Florence-2という新しいビジョン基盤モデルを紹介します。このモデルは、さまざまなコンピュータビジョンおよびビジョン言語タスクに対応するための統一されたプロンプトベースの表現を持っています。既存の大規模なビジョンモデルは転移学習に優れていますが、単純な指示でさまざまなタスクを実行することに苦労しています。これは、さまざまな空間の階層と意味の粒度の複雑さを扱う能力を必要とするからです。 Florence-2は、テキストプロンプトをタスクの指示として受け取り、キャプショニング、オブジェクト検出、グラウンディング、セグメンテーションなどの望ましい結果をテキスト形式で生成するように設計されています。このマルチタスク学習のセットアップでは、大規模で高品質な注釈付きデータが必要です。そのため、私たちはFLD-5Bを共同開発しました。これは、自動化された画像注釈とモデルの改善の反復戦略を用いて、1億2600万枚の画像に対して54億の包括的な視覚注釈を行ったものです。私たちは、Florence-2を多目的かつ包括的なビジョンタスクを実行するためにシーケンスツーシーケンス構造を採用しました。数多くのタスクでの徹底的な評価により、Florence-2が前例のないゼロショットおよびファインチューニングの能力を持つ強力なビジョン基盤モデルの候補であることが示されました。
Summary (by gpt-3.5-turbo)
Florence-2は、ビジョン基盤モデルであり、さまざまなビジョンタスクに対応するための統一されたプロンプトベースの表現を持っています。このモデルは、テキストプロンプトを受け取り、キャプショニング、オブジェクト検出、グラウンディング、セグメンテーションなどのタスクを実行し、テキスト形式で結果を生成します。また、FLD-5Bという大規模な注釈付きデータセットも開発されました。Florence-2は、多目的かつ包括的なビジョンタスクを実行するためにシーケンスツーシーケンス構造を採用しており、前例のないゼロショットおよびファインチューニングの能力を持つ強力なモデルです。

AkihikoWatanabe / paper_notes

Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks, Bin Xiao+, N/A, arXiv'23 #1127

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)