Open Adamits opened 7 months ago
I thought about this more. Since the residual layers in transformers are just summing self_attn and mha_attention (with layer norm in between), I don't think we can make this update without fundamentally changing the transformer architecture (e.g. via concatenating them, or projecting one into the size of the other).
I think the best thing to do is either:
One place that 1) gives us an issue is if we want to use an LSTM encoder with a transformer decoder. Then the encoder outputs hidden_size * num_directions
and the transformer expects embedding_size
. This limits the shape of a valid architecture quite a bit. Not sure if that is a problem or not though.
I think either would be fine. This is a good example of a second type of presupposition we will want to test for before training begins.
I want to say there was a variant of transformer a while back that approached this problem (Sumformer maybe). But I think the ideal solution would be do simply add an additional layer perceptron to force alignment. Personally I don't think it's too much variation on transformer architecture since everyone and their grandmother creates an inhouse variant. (You'll note that no one uses PyTorch's base form.)
Regarding layer norms. What you caaan mess with is swapping out with batch norm. Bit too late for me to do the maths for the main issue, but it may give more flexibility with variations in depth.
@bonham79 is on point about how everyone has a slightly different transfomer variant and it's okay.
It would be convenient to allow the encoder output_size to be different from the TransformerDecoder embedding size. To illustrate the issue with this, the below code snippet
throws:
But if I change the code such that
encoder_hidden = torch.randn(b, seq_len, hid)
-->encoder_hidden = torch.randn(b, seq_len, emb)
, then this works fine.Essentially, we need the self-attention and multihead-attention to expect different input sizes (which may also require the layer norms to change too).
I am putting this up, and will try to work out a solution. The easiest thing for allowing this behavior in yoyodyne would be to either project the encoder output size into the decoder embedding size, or visa versa, but I feel that this changes the architecture more than necessary. Instead, I would like to consider if there is an elegant way to update the sa_block and mha_block such that it does not break other things in the transformer (e.g. layer norm).