awasthiabhijeet / PIE

Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction": www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)
MIT License
227 stars 40 forks source link

Is the edit space consistent during pre-train and fine-tune? #4

Closed Serenade-J closed 4 years ago

Serenade-J commented 4 years ago

Hi, I have a question about whether the edit_space(\Sigma_a) is consistent when pre-train and fine-tune? Is it derived from the Lang8 dataset although the distribution of synthetic data is different from the Lang8?

awasthiabhijeet commented 4 years ago

Yes, the edit space is consistent during the synthetic-training and fine-tuning steps of training the GEC model and was generated using \Sigma_a from here \Sigma_a here is composed of word-piece uni-grams (common_inserts.p) or bi-grams (common_multitoken_inserts.p).

However, while generating pseudo data, we did not use a word-piece tokenizer. Thus, pickle files in the errorify directory are somewhat different (i.e. they contain whole words and not word-pieces). Also, the replace pickle in the errorify directory also contains a mapping of replacements with commonly replaced words. This is helpful for introducing systematic errors while generating synthetic GEC data.

All the pickles were obtained through diffs between lang8+nucle+fce datasets