Question regarding Backtranslation

wasiahmad commented 3 years ago

Hi,

I have a basic question to understand why the backtranslation works in this scenario. Typically in NLP, we collect some parallel data to train Transformer-like models and then use backtranslation (BT) on a large collection of monolingual data.

In contrast, TransCoder is first gone through a pre-training stage and then trained via BT. Since, TransCoder does not have any idea about cross-language generation, at the beginning of BT, TransCoder presumably would generate the sequence in the same language (from Java input to Java output, instead of python output). So, feeding the generated sequence to translate back to the original sequence is not going to help the model in learning translation. So, how backtranslation provides the learning bias to perform translation?

Recently, I tried to apply BT to our model, called PLBART to teach it to perform translation. However, at the very beginning of BT training, when I checked what PLBART generates for a given Java input, I saw it generates exactly the input sequence although the generation is done based on a prefix token for the target language python. For example,

# input
static public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; } [java] 

# output
[python] public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; }

As you can see above, exactly the same sequence is generated. PLBART is pre-trained via Denoising Autoencoding (DAE), thus it doesn't have any clue about cross-language generation. I am curious, how does TransCoder learn from BT?

If I am not wrong, TransCoder uses language embedding with each input token (REF). Do you think that can make a difference? Also, can you shed light on the TransCoder structure? It seems like TransCoder does not have a typical sequence-to-sequence architecture.

baptisteroziere commented 3 years ago

Hi, The BT works because the DAE objective (which you use in PLBART) should already be enough to get a model to translate reasonably well. Since your model just generates the input java function when you ask it to generate a function with the [python] token at the beginning, am I right to assume that you added this token only when generating sentences and not when training the DAE or BT objectives ? As you noticed, we use languages embeddings with each input token. We didn't try adding one input token for programming languages but it makes only a marginal difference for NL and I don't think it would make a big difference there. With an input token, if you want to make it work you need to train your DAE to generate:

input: 
[python] noisy original sentence with masked tokens

output:
[python] original sentence

and BT to generate

input: 
[python] noisy python translation generated with [python] first token

output:
[java] original java sentence

Then your decoder only learns to generate python code after a [python] token or java code after a [java] token and should assign a low likelihood to java code after a [python] token. Tell me if you still have some questions about the structure of TransCoder.

wasiahmad commented 3 years ago

@brozi I agree with you and we expect that after DAE pre-training, the model will learn to generate based on the prefix token (target language id) but in reality, it is not happening. I observed similar issues in NLP models, e.g., (MBART) that they mostly do not generate the sequence in the target language after DAE-based pre-training.

We trained PLBART as follows.

input: 
noisy original sentence with masked tokens [python]

output:
[python] original sentence

We append the language id in the source sequence (instead of prepending). For BT, we do exactly what you said (the only difference is we append the lang_id in the source sequence). But initially, PLBART simply generates the input sequence, no matter what language id is given as a prefix. It perhaps shows that PLBART learned to generate the output sequence in the same language as the input (and do not respect the prefix token).

That's why I asked this question about TransCoder. I was guessing that only adding a prefix provides a very weak signal to PLBART. Since, only the self-attention mechanism in the decoder is going to take the prefix token into consideration while learning to attend the previously generated tokens. And in the DAE task, the prefix token does not carry much significance (in my opinion).

On the other hand, since TransCoder uses language id and adds to every token's embedding, the model perhaps cannot ignore the constraint that the output sequence must be in the target language.

wasiahmad commented 3 years ago

One more question, what is the structure of TransCoder? It seems like TransCoder does not have a typical sequence-to-sequence architecture. Specifically, I tried to understand this part of the code but couldn't get it.

baptisteroziere commented 3 years ago

I agree that adding a language embedding for every token can provide a stronger signal, that's the reason why we chose to do it this way. I would have thought that adding a language token at the beginning of the sentence would be enough to generate a function in the target sentence since the model can learn to pay attention to this token. It should at least be able to learn to generate a "def" token instead of "public static" after a [python] token. I think the main difference could be that you are training with the denoising objective until convergence at first and then with the BT objective while we trained both at the same time. Then you can end up in a bad local minima where the model just copies the input sentence and ignore the language token. If you train with the BT objective at the same time, your model will learn early that the [python] token is followed by something like "def" (or import if you train on whole files) and it should work better. Actually we trained a baseline with denoising for a revision of DOBF and we only reloaded the encoder to make the unsupervised translation work. Otherwise we also get stuck in a state where the model copies the input sentence.

About the part of our code you linked to, we use the same TransformerModel class for our encoders and decoders and just add a class attribute is_decoder to know what kind of transformer it is and whether we should do cross attention.

wasiahmad commented 3 years ago

Thanks a lot for your comments. Perhaps, simultaneous training via DAE and BT is the key factor.

facebookresearch / CodeGen

Question regarding Backtranslation #16