Fast + Non-Autoregressive Grammatical Error Correction using BERT. Code and Pre-trained models for paper "Parallel Iterative Edit Models for Local Sequence Transduction": www.aclweb.org/anthology/D19-1435.pdf (EMNLP-IJCNLP 2019)
MIT License
228
stars
40
forks
source link
Attention mask for computation of replace and append operation #22
Hi, you mentioned in the papar that we calculate r{i}^{l} over h{j}^{l} for all j except i, but calculate a{i}^{l} over h{j}^{l} for all j including i.
Why there is such a difference that we can't have information about the current token x_{i} when dealing with the replace operation but have access to the current token for append operation on the contrary?
Hi, you mentioned in the papar that we calculate r{i}^{l} over h{j}^{l} for all j except i, but calculate a{i}^{l} over h{j}^{l} for all j including i. Why there is such a difference that we can't have information about the current token x_{i} when dealing with the replace operation but have access to the current token for append operation on the contrary?