FreedomIntelligence / GrammarGPT

The code and data for GrammarGPT.
Apache License 2.0
164 stars 9 forks source link

GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning

✨ Latest News

⚑ Introduction

Welcome to the repository of GrammarGPT.

The implementation repository for NLPCC 2023 Sharedtask1, which achieves third place.

Here is a list of what has been released:

πŸ’­ Overview

We introduced GrammarGPT, an open-source LLM, to preliminary explore its potential for native Chinese grammatical error correction. The core recipe of GrammarGPT is to leverage the hybrid dataset of ChatGPT-generated and human-annotated. For grammatical errors with clues, we proposed a heuristic method to guide ChatGPT to generate ungrammatical sentences by providing those clues. For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them. In addition, we employed an error-invariant augmentation method to enhance the ability of the model to correct native Chinese grammatical errors.

πŸ“š Construction of Hybrid Dataset

- This table shows the six main types of grammatical errors made by native Chinese speakers, which can be divided into two types, e.g., with (w/) and without (w/o) clues. We can find that the incorrect sentences are fluent and in line with the habits of native Chinese. However, they do not conform to Chinese grammar, which is more difficult to correct. We utilized both ChatGPT-generated data and human-annotated data for dealing with grammatical errors with and without clues, respectively.

ChatGPT-generated Data

Grammatical errors with clues are easy to detect and correct by recognizing the specific clues. For example, more than and about are used together leading to redundant component, The cause and caused by are used together leading to structural confusion, and prompting and pace are used together leading to improper collocation. Conversely, we can construct ungrammatical sentences by inserting these cues into grammatical sentences. We can instruct ChatGPT to generate the ungrammatical sentences that meet our requirements by providing these Clues collected from public websites.

Human-annotated Data

For those ungrammatical errors,we collected data from public websites 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 and manaually annotated them.

Error-invariant Augmentation

Native Chinese grammatical errors are often subtle and infrequently found in the position of named entities. Therefore, we adopt a strategy of substituting the named entities in the parallel data with similar ones(Synonyms).

πŸš€ Training

python finetuning.py

🧐 Inferencing

python generate.py

πŸ˜€ Acknowledgement

We are aware that our works are inspired by the following works, including but not limited to

Without these, nothing could happen in this repository.

Citation

@inproceedings{fan2023grammargpt,
  title={GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical Error Correction with Supervised Fine-Tuning},
  author={Fan, Yaxin and Jiang, Feng and Li, Peifeng and Li, Haizhou},
  booktitle={CCF International Conference on Natural Language Processing and Chinese Computing},
  pages={69--80},
  year={2023},
  organization={Springer}
}

We are from the School of Data Science, the Chinese University of Hong Kong, Shenzhen (CUHKSZ), and the Shenzhen Research Institute of Big Data (SRIBD).

The first author is a visiting student from Soochow University, and we welcome aspiring individuals to join our group and contribute to the new era of LLM.

Star History Chart