facebookresearch / DisCo

DisCo Transformer for Non-autoregressive MT
Other
78 stars 9 forks source link

seek suggestions on alignment analysis #2

Open alphadl opened 4 years ago

alphadl commented 4 years ago

Hi, Jungo. I am trying to analyze the alignment when the refinement iteration goes by force decoding.

The basic idea is to replace the golden target tokens with MASK from the L2R fashion, like the BERT language model. However, this can not analyze the dynamics for different refine iterations. Can you recommend some effective approaches to analyze the alignment?

jungokasai commented 4 years ago

I am not entirely sure what you mean by alignment here. Do you mean alignment between the source and target words? Also, what do you mean by "force decoding"? Thank you!

alphadl commented 4 years ago

Thanks for your prompt reply ~ I want to check whether the Iterative refinement mainly improves the alignment quality between Src and Tgt, and normally researchers adopt force decoding, take this paper for instance, they employ force decoding to analyze the alignment wrt AER: https://arxiv.org/pdf/1906.10282.pdf.

jungokasai commented 4 years ago

This seems like a nice hypothesis. I had a somewhat related hypothesis that DisCo/CMLM needs many decoder layers because it needs to reorder the source to generate the target, compared to autoregressive models that just estimate conditionals. This isn't really an analysis on dynamics, but I did a controlled experiment in my new paper where I used reordered English data for en-de translation (see Sec. 6). One thing we can do here is to take this data and see how much we benefit from refinement. If the gain from refinement decreases when we use reordered data instead of the original data, that supports your hypothesis because we can say that monotonic alignment makes refinement less important. If you're interested in this analysis, here is a link to my reordered data. All English text is reordered using fast align and the target German sentences.

alphadl commented 4 years ago

Thanks for your helpful suggestion and pretrained model😊. I highly appreciate that deep encoder shallow decoder work, which is inspiring and makes the community to rethinking the in-depth value of non-autoregressive generation rather than just conducting a bunch of patching works. BTW, I assumed that due to the lack of self conditionals (Y_t|Y<t), the NAT conditions more on X, that is, (Y_t|X). Which makes the cross attention module takes more ''responsibilities'' than self attention module in decoder. Also, I guess that not only the reordering ability, the bilingual phrasal information extraction and coverage modeling ablity of NAT are also weaker than AT and they are enhanced with iterative refinement goes. I will validate them later.

alphadl commented 4 years ago

I have a detailed quesiton about your En-De reordering experiment.

It seems that you not only reordered the English sentence in training data, you also reordered the test and valid data? Otherwise, its hard to get 30+BLEU for En-De. Previsous studies that reordered the source word orders to reconstruct the training corpus (Du and Way, 2017; Zhao et al.,2018; Kawara et al., 2018; Zhou et al., 2019) have shown that directly performing the reordering at data-level will improve the low-resource MT but may harm the medium&large scale MT.

jungokasai commented 4 years ago