Despite the power of Large Language Models (LLMs) like GPT-4, they stillstruggle with tasks that require generating complex, structured outputs. Inthis study, we assess the capability of Current LLMs in generating complexstructured data and propose a structure-aware fine-tuning approach as asolution to improve this ability. To perform a comprehensive evaluation, wepropose Struc-Bench, include five representative LLMs (i.e., GPT-NeoX 20B,GPT-3.5, GPT-4, and Vicuna) and evaluate them on our carefully constructeddatasets spanning raw text, HTML, and LaTeX tables. Based on our analysis ofcurrent model performance, we identify specific common formatting errors andareas of potential improvement. To address complex formatting requirements, weutilize FormatCoT (Chain-of-Thought) to generate format instructions fromtarget outputs. Our experiments show that our structure-aware fine-tuningmethod, when applied to LLaMA-7B, significantly improves adherence to naturallanguage constraints, outperforming other evaluated LLMs. Based on theseresults, we present an ability map of model capabilities from six dimensions(i.e., coverage, formatting, reasoning, comprehension, pragmatics, andhallucination). This map highlights the weaknesses of LLMs in handling complexstructured outputs and suggests promising directions for future work. Our codeand models can be found at https://github.com/gersteinlab/Struc-Bench.
URL
Affiliations
Abstract
Translation (by gpt-3.5-turbo)
Summary (by gpt-3.5-turbo)