We present Unified-IO 2, the first autoregressive multimodal model that iscapable of understanding and generating image, text, audio, and action. Tounify different modalities, we tokenize inputs and outputs -- images, text,audio, action, bounding boxes, etc., into a shared semantic space and thenprocess them with a single encoder-decoder transformer model. Since trainingwith such diverse modalities is challenging, we propose various architecturalimprovements to stabilize model training. We train our model from scratch on alarge multimodal pre-training corpus from diverse sources with a multimodalmixture of denoisers objective. To learn an expansive set of skills, such asfollowing multimodal instructions, we construct and finetune on an ensemble of120 datasets with prompts and augmentations. With a single unified model,Unified-IO 2 achieves state-of-the-art performance on the GRIT benchmark andstrong results in more than 35 benchmarks, including image generation andunderstanding, natural language understanding, video and audio understanding,and robotic manipulation. We release all our models to the research community.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)