Fast Grammatical Error Correction using BERT
Code and Pre-trained models accompanying our paper "Parallel Iterative Edit Models for Local Sequence Transduction" (EMNLP-IJCNLP 2019)
PIE is a BERT based architecture for local sequence transduction tasks like Grammatical Error Correction. Unlike the standard approach of modeling GEC as a task of translation from "incorrect" to "correct" language, we pose GEC as local sequence editing task. We further reduce local sequence editing problem to a sequence labeling setup where we utilize BERT to non-autoregressively label input tokens with edits. We rewire the BERT architecture (without retraining) specifically for the task of sequence editing. We find that PIE models for GEC are 5 to 15 times faster than existing state of the art architectures and still maintain a competitive accuracy. For more details please check out our EMNLP-IJCNLP 2019 paper
@inproceedings{awasthi-etal-2019-parallel,
title = "Parallel Iterative Edit Models for Local Sequence Transduction",
author = "Awasthi, Abhijeet and
Sarawagi, Sunita and
Goyal, Rasna and
Ghosh, Sabyasachi and
Piratla, Vihari",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1435",
doi = "10.18653/v1/D19-1435",
pages = "4259--4269",
}
All the public GEC datasets used in the paper can be obtained from here
Inference using the pretrained PIE ckpt
$ ./multi_round_infer.sh
from PIE_ckpt directoryAn example usage of code in described in the directory "example_scripts".
preprocess.sh
pie_train.sh
multi_round_infer.sh
m2_eval.sh
end_to_end.sh
More information in README.md inside "example_scripts"
Pre processing and Edits related
seq2edits_utils.py
get_edit_vocab.py : Extracts common insertions (\Sigma_a set as described in paper) from a parallel corpus
get_seq2edits.py : Extracts edits aligned to input tokens
tokenize_input.py : tokenize a file containing sentences. token_ids obtained go as input to the model.
opcodes.py : A class where members are all possible edit operations
transform_suffixes.py: Contains logic for suffix transformations
tokenization.py : Similar to BERT's implementation, with some GEC specific changes
PIE model (uses implementation of BERT of bert in Tensorflow)
word_edit_model.py: Implementation of PIE for learning from a parallel corpous of incorrect tokens and aligned edits.
modeling.py : Same as in BERT's implementation
modified_modeling.py
optimization.py : Same as in BERT's implementation
Post processing
apply_opcode.py
Creating synthetic GEC dataset
errorify directory contains the scripts we used for perturbing the one-billion-word corpus
This research was partly sponsored by a Google India AI/ML Research Award and Google PhD Fellowship in Machine Learning. We gratefully acknowledge Google's TFRC program for providing us Cloud-TPUs. Thanks to Varun Patil for helping us improve the speed of pre-processing and synthetic-data generation pipelines.