Is the edit space consistent during pre-train and fine-tune?

awasthiabhijeet / PIE

Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction": www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)

MIT License

227 stars 40 forks source link

Yes, the edit space is consistent during the synthetic-training and fine-tuning steps of training the GEC model and was generated using \Sigma_a from here \Sigma_a here is composed of word-piece uni-grams (common_inserts.p) or bi-grams (common_multitoken_inserts.p).

However, while generating pseudo data, we did not use a word-piece tokenizer. Thus, pickle files in the errorify directory are somewhat different (i.e. they contain whole words and not word-pieces). Also, the replace pickle in the errorify directory also contains a mapping of replacements with commonly replaced words. This is helpful for introducing systematic errors while generating synthetic GEC data.

All the pickles were obtained through diffs between lang8+nucle+fce datasets

awasthiabhijeet / PIE

Is the edit space consistent during pre-train and fine-tune? #4