2018, NAACL, Noising and Denoising Natural Language: Diverse Back Translation for Grammar Correction

Sepideh-Ahmadian commented 1 week ago

Paper Noising and Denoising Natural Language: Diverse Back Translation for Grammar Correction

Introduction This research proposes a solution for data sparsity (noisy and clean pairs) in grammar correction in the NLP domain. The lack of enough noisy and clear pairs is a bottleneck in developing machine translation models. by noising is means they add some grammatical error to sentences and then denoising (refine) it using a language model.

Main Problem There is a need to provide a large corpus of parallel noisy and clean in the field of grammar correction. This article suggests alleviating this problem by generating synthetic noisy data from clean one. To generate data, they proposed a method inspired by back translation from machine translation.

Illustrative Example Clean version: "Day after day, I get up at 8 o'clock" synthesized noisy version: "I got up at 8 o'clock day after day."

Input Noisy sentence (having grammatical mistakes)

Output Clean and grammatically correct sentence

Motivation The authors were motivated by the need to overcome the data sparsity issue in grammar correction. Grammar correction systems often require a large corpus of parallel noisy and clean sentence pairs, which are hard to come by. The motivation was to generate synthetic noisy sentences from clean ones, which would allow training neural models for grammar correction without the need for extensive manually curated data.

Related works and their gaps The paper addresses gaps related to the lack of realistic, diverse error types in previous methods for synthesizing noisy data. (Brockett et al., 2006; Felice, 2016). Previous approaches often generated unrealistic noise or were limited to local context windows. (Linzen et al., 2016- Sennrich et al., 2015) The authors aim to generate more realistic, diverse noisy sentences through neural sequence transduction and back-translation techniques.

Contribution of this paper The paper’s main contributions include: Proposing a neural sequence transduction model for generating synthetic noisy data for grammar correction. Introducing several beam search noising procedures to produce diverse and realistic noisy sentences. Demonstrating that the synthesized data improves grammar correction performance, nearly matching the performance of models trained on large parallel corpora of real noisy data.

Proposed methods Not included

Experiments The model is evaluated on the CoNLL 2013 and 2014 datasets for grammar correction and the JFLEG test set, which evaluates fluency in grammar correction.

Implementation Not mentioned

Gaps this work I believe based on the limited training dataset the synthesized noisy data may not capture all real-world grammatical errors. Therefore the model does not present good performance in various domains.

hosseinfani commented 1 week ago

@Sepideh-Ahmadian I had an idea of fixing the grammatical or any type of errors in a sentence using backtranslations in an unsupervised way. this is the same idea, right?

Sepideh-Ahmadian commented 1 week ago

@hosseinfani, The purpose of this research project is to generate additional data in the machine translation domain, creating a corpus of correct and noisy sentence pairs. I think we should do some digging in Grammar Correction literature.

fani-lab / LADy

2018, NAACL, Noising and Denoising Natural Language: Diverse Back Translation for Grammar Correction #84