Closed DenglinGo closed 1 year ago
Yes, actually z_y contains the information of the source sequence. Since all negatives has to attend the same source sequence output H_X, ( the i-th example yi has z{y_i} =g(y_i, H_x) ) so that we think this leakage may not affect the optimization of contrastive loss too much. We did not try this in our paper because getting the feature from the decoder without cross attention requires modifying the base model and makes it inconvenient to replace the mainstream MLE model with CoNT.
As described in the article 2205.14690: "The feature representations come from pooling the output of the encoder (source sequence) or decoder (target sequence) ." However, the Transformer decoders contain cross attention modules, wouldn't this lead to the information leakage of the target sequence feature representation?