URL

https://arxiv.org/abs/2205.05131
Affiliations
- Yi Tay, N/A
- Mostafa Dehghani, N/A
- Vinh Q. Tran, N/A
- Xavier Garcia, N/A
- Jason Wei, N/A
- Xuezhi Wang, N/A
- Hyung Won Chung, N/A
- Siamak Shakeri, N/A
- Dara Bahri, N/A
- Tal Schuster, N/A
- Huaixiu Steven Zheng, N/A
- Denny Zhou, N/A
- Neil Houlsby, N/A
- Donald Metzler, N/A
  Abstract
- Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.
  Translation (by gpt-4o-mini)
既存の事前学習モデルは一般的に特定の問題クラスに向けて設計されています。これまでのところ、適切なアーキテクチャや事前学習の設定についての合意は得られていないようです。本論文では、データセットや設定に対して普遍的に効果的なモデルの事前学習のための統一フレームワークを提案します。まず、事前学習の目的とアーキテクチャの原型を分離し、これら二つの概念が一般的に混同されていることを明らかにします。次に、自然言語処理における自己監視の一般化された統一的な視点を提示し、異なる事前学習の目的がどのように相互に変換可能であるか、また異なる目的間の補間がどのように効果的であるかを示します。さらに、さまざまな事前学習のパラダイムを組み合わせた事前学習の目的であるMixture-of-Denoisers（MoD）を提案します。また、下流のファインチューニングが特定の事前学習スキームに関連付けられるモードスイッチングの概念も導入します。複数の事前学習の目的を比較するために広範なアブレーション実験を行い、我々の手法がT5やGPTのようなモデルを上回り、パレートフロンティアを押し上げることを発見しました。モデルを20Bパラメータまでスケールアップすることで、50の確立された監視ファインチューニングベースのNLPタスクでSOTAパフォーマンスを達成しました。我々のモデルは、インコンテキスト学習でも強力な結果を達成し、ゼロショットのSuperGLUEで175BのGPT-3を上回り、ワンショット要約においてT5-XXLのパフォーマンスを3倍にしました。ゼロショットのMMLUでは、UL2 20BがT0およびT5モデルを上回ります。UL2 20Bは、チェーン・オブ・ソートプロンプティングと推論にも適しており、20Bパラメータの小規模から中規模の推論研究に魅力的な選択肢となります。最後に、UL2 20BモデルにFLAN指示チューニングを適用し、MMLUおよびBig-BenchスコアでFLAN-PaLM 62Bに匹敵する結果を達成しました。UL2 20BおよびFlan-UL2 20BのためのFlaxベースのT5Xチェックポイントを公開します。
Summary (by gpt-4o-mini)
本論文では、事前学習モデルの普遍的なフレームワークを提案し、事前学習の目的とアーキテクチャを分離。Mixture-of-Denoisers（MoD）を導入し、複数の事前学習目的の効果を示す。20Bパラメータのモデルは、50のNLPタスクでSOTAを達成し、ゼロショットやワンショット学習でも優れた結果を示す。UL2 20Bモデルは、FLAN指示チューニングにより高いパフォーマンスを発揮し、関連するチェックポイントを公開。

AkihikoWatanabe / paper_notes

UL2: Unifying Language Learning Paradigms, Yi Tay+, N/A, arXiv'22 #1424

URL

Affiliations

Abstract

Translation (by gpt-4o-mini)

Summary (by gpt-4o-mini)