seek suggestions on alignment analysis

alphadl commented 4 years ago

Hi, Jungo. I am trying to analyze the alignment when the refinement iteration goes by force decoding.

The basic idea is to replace the golden target tokens with MASK from the L2R fashion, like the BERT language model. However, this can not analyze the dynamics for different refine iterations. Can you recommend some effective approaches to analyze the alignment?

jungokasai commented 4 years ago

I am not entirely sure what you mean by alignment here. Do you mean alignment between the source and target words? Also, what do you mean by "force decoding"? Thank you!

alphadl commented 4 years ago

Thanks for your prompt reply ~ I want to check whether the Iterative refinement mainly improves the alignment quality between Src and Tgt, and normally researchers adopt force decoding, take this paper for instance, they employ force decoding to analyze the alignment wrt AER: https://arxiv.org/pdf/1906.10282.pdf.

jungokasai commented 4 years ago

This seems like a nice hypothesis. I had a somewhat related hypothesis that DisCo/CMLM needs many decoder layers because it needs to reorder the source to generate the target, compared to autoregressive models that just estimate conditionals. This isn't really an analysis on dynamics, but I did a controlled experiment in my new paper where I used reordered English data for en-de translation (see Sec. 6). One thing we can do here is to take this data and see how much we benefit from refinement. If the gain from refinement decreases when we use reordered data instead of the original data, that supports your hypothesis because we can say that monotonic alignment makes refinement less important. If you're interested in this analysis, here is a link to my reordered data. All English text is reordered using fast align and the target German sentences.

alphadl commented 4 years ago

Thanks for your helpful suggestion and pretrained model😊. I highly appreciate that deep encoder shallow decoder work, which is inspiring and makes the community to rethinking the in-depth value of non-autoregressive generation rather than just conducting a bunch of patching works. BTW, I assumed that due to the lack of self conditionals (Y_t|Y<t), the NAT conditions more on X, that is, (Y_t|X). Which makes the cross attention module takes more ''responsibilities'' than self attention module in decoder. Also, I guess that not only the reordering ability, the bilingual phrasal information extraction and coverage modeling ablity of NAT are also weaker than AT and they are enhanced with iterative refinement goes. I will validate them later.

alphadl commented 4 years ago

I have a detailed quesiton about your En-De reordering experiment.

It seems that you not only reordered the English sentence in training data, you also reordered the test and valid data? Otherwise, its hard to get 30+BLEU for En-De. Previsous studies that reordered the source word orders to reconstruct the training corpus (Du and Way, 2017; Zhao et al.,2018; Kawara et al., 2018; Zhou et al., 2019) have shown that directly performing the reordering at data-level will improve the low-resource MT but may harm the medium&large scale MT.

jungokasai commented 4 years ago

Yes, your understanding is correct about our reordering experiments. The purpose was to see how reordering would help in different types of models and test our hypothesis, rather than developing a better model. So the results are not meant to be directly compared to the original ones. If we want to improve a model by reordering, we need to develop an (external) reordering model that only takes as input source text like you pointed out.
Interesting observation about cross attention. I share the same intuition that a pure one-shot NAT model relies more on cross attention. But I am not sure if this is true for a CMLM-style iterative model. Masked language modeling seems to have just as strong conditioning as an autoregressive model. What is very interesting is though that we are getting evidence that cross attention is more important even in autoregressive models (see You et al. 2020, Tay et al. 2020, where they were able to simplify self attention). So at this point, I feel that cross attention is most important regardless of the model type, though the degree could vary.

facebookresearch / DisCo

seek suggestions on alignment analysis #2