URL

https://arxiv.org/abs/2312.17172
Affiliations
- Jiasen Lu, N/A
- Christopher Clark, N/A
- Sangho Lee, N/A
- Zichen Zhang, N/A
- Savya Khosla, N/A
- Ryan Marten, N/A
- Derek Hoiem, N/A
- Aniruddha Kembhavi, N/A
  Abstract
- We present Unified-IO 2, the first autoregressive multimodal model that iscapable of understanding and generating image, text, audio, and action. Tounify different modalities, we tokenize inputs and outputs -- images, text,audio, action, bounding boxes, etc., into a shared semantic space and thenprocess them with a single encoder-decoder transformer model. Since trainingwith such diverse modalities is challenging, we propose various architecturalimprovements to stabilize model training. We train our model from scratch on alarge multimodal pre-training corpus from diverse sources with a multimodalmixture of denoisers objective. To learn an expansive set of skills, such asfollowing multimodal instructions, we construct and finetune on an ensemble of120 datasets with prompts and augmentations. With a single unified model,Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark andstrong results in more than 35 benchmarks, including image generation andunderstanding, natural language understanding, video and audio understanding,and robotic manipulation. We release all our models to the research community.
  Translation (by gpt-3.5-turbo)
Unified-IO 2は、画像、テキスト、音声、アクションを理解し生成することができる、最初の自己回帰型のマルチモーダルモデルです。異なるモダリティを統一するために、画像、テキスト、音声、アクション、バウンディングボックスなどの入力と出力をトークン化し、共有の意味空間に配置し、単一のエンコーダ・デコーダトランスフォーマーモデルで処理します。このような多様なモダリティでのトレーニングは困難ですので、モデルのトレーニングを安定化させるために、さまざまなアーキテクチャの改善を提案しています。私たちは、多様なソースからの大規模なマルチモーダルな事前トレーニングコーパスを用いて、マルチモーダルなノイズ除去の目的の混合モデルで、モデルをゼロからトレーニングします。マルチモーダルな指示の従うなどの幅広いスキルを学ぶために、120のデータセットのアンサンブルを構築し、プロンプトと拡張を行って微調整します。単一の統一モデルで、Unified-IO 2はGRITベンチマークで最先端のパフォーマンスを達成し、画像生成と理解、自然言語理解、ビデオと音声理解、ロボットの操作など、35以上のベンチマークで強力な結果を出します。私たちは、すべてのモデルを研究コミュニティに公開しています。
Summary (by gpt-3.5-turbo)
Unified-IO 2は、最初の自己回帰型のマルチモーダルモデルであり、画像、テキスト、音声、アクションを理解し生成することができます。異なるモダリティを統一するために、共有の意味空間に入力と出力を配置し、単一のエンコーダ・デコーダトランスフォーマーモデルで処理します。さまざまなアーキテクチャの改善を提案し、大規模なマルチモーダルな事前トレーニングコーパスを使用してモデルをトレーニングします。Unified-IO 2は、GRITベンチマークを含む35以上のベンチマークで最先端のパフォーマンスを発揮します。

AkihikoWatanabe / paper_notes

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action, Jiasen Lu+, N/A, arXiv'23 #1202

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)