URL

https://arxiv.org/abs/2309.08963
Affiliations
- Xiangru Tang, N/A
- Yiming Zong, N/A
- Jason Phang, N/A
- Yilun Zhao, N/A
- Wangchunshu Zhou, N/A
- Arman Cohan, N/A
- Mark Gerstein, N/A
  Abstract
- Despite the power of Large Language Models (LLMs) like GPT-4, they stillstruggle with tasks that require generating complex, structured outputs. Inthis study, we assess the capability of Current LLMs in generating complexstructured data and propose a structure-aware fine-tuning approach as asolution to improve this ability. To perform a comprehensive evaluation, wepropose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B,GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructeddatasets spanning raw text, HTML, and LaTeX tables. Based on our analysis ofcurrent model performance, we identify specific common formatting errors andareas of potential improvement. To address complex formatting requirements, weutilize FormatCoT (Chain-of-Thought) to generate format instructions fromtarget outputs. Our experiments show that our structure-aware fine-tuningmethod, when applied to LLaMA-7B, significantly improves adherence to naturallanguage constraints, outperforming other evaluated LLMs. Based on theseresults, we present an ability map of model capabilities from six dimensions(i.e., coverage, formatting, reasoning, comprehension, pragmatics, andhallucination). This map highlights the weaknesses of LLMs in handling complexstructured outputs and suggests promising directions for future work. Our codeand models can be found at https://github.com/gersteinlab/Struc-Bench.
  Translation (by gpt-3.5-turbo)
大規模言語モデル（LLMs）のようなGPT-4の能力にもかかわらず、複雑な構造化された出力を生成するタスクにはまだ苦労しています。本研究では、現在のLLMsの複雑な構造化データ生成能力を評価し、この能力を向上させるための構造に注意したファインチューニング手法を提案します。包括的な評価を行うために、Struc-Benchを提案し、GPT-NeoX 20B、GPT-3.5、GPT-4、およびVicunaの5つの代表的なLLMsを含み、テキスト、HTML、およびLaTeXのテーブルをカバーする慎重に構築されたデータセットで評価します。現在のモデルのパフォーマンスの分析に基づいて、特定の共通のフォーマットエラーや改善の可能性のある領域を特定します。複雑なフォーマット要件に対処するために、ターゲットの出力からフォーマット指示を生成するためにFormatCoT（Chain-of-Thought）を利用します。実験の結果、LLaMA-7Bに適用された構造に注意したファインチューニング手法は、自然言語の制約に対する遵守度を大幅に向上させ、他の評価されたLLMsよりも優れた性能を示しました。これらの結果に基づいて、6つの次元（カバレッジ、フォーマット、推論、理解、語用論、幻想）からなるモデルの能力マップを提示します。このマップは、LLMsが複雑な構造化された出力を処理する際の弱点を示し、将来の研究の有望な方向を示唆しています。コードとモデルはhttps://github.com/gersteinlab/Struc-Benchで入手できます。
Summary (by gpt-3.5-turbo)
本研究では、大規模言語モデル（LLMs）の能力を評価し、構造に注意したファインチューニング手法を提案します。さらに、Struc-Benchというデータセットを使用して、複雑な構造化データ生成のパフォーマンスを評価します。実験の結果、提案手法は他の評価されたLLMsよりも優れた性能を示しました。また、モデルの能力マップを提示し、LLMsの弱点と将来の研究の方向性を示唆しています。詳細はhttps://github.com/gersteinlab/Struc-Benchを参照してください。

AkihikoWatanabe / paper_notes

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?, Xiangru Tang+, N/A, arXiv'23 #1046

URL

Affiliations

Abstract

Translation (by gpt-3.5-turbo)

Summary (by gpt-3.5-turbo)