Open dinhanhx opened 10 months ago
I notice that there are some asserts to make sure the model doesn't accept encoder hidden state and cross attention.
Let say, I want to implement encoder-decoder-ish transformer. How can I use with your simplified transformer blocks?
Hi, thanks for the question. To be honest we haven't thought much about the encoder-decoder attention. I think it would be compatible with our simplified transformer block, with some additional modifications to shaped attention to account for the encoder representations, though I haven't tried it out. If there was interest in implementing it to test if it works I would be happy to be involved.
Yesterday I tried to read your code. I'm still not confident. Maybe I haven't understood well enough. I also think it's doable with a bit modification to your layer. Obviously, your code is based on GPT2Attention
If there was interest in implementing it to test if it works I would be happy to be involved.
I may want to implement such things. But I think I will have lots of questions about shaped attention.
So, compared to the OG Transformer block, eq 13rd doesn't have weight for V and does have some extra alpha, beta, gamma weight, right?
For example, you can compare with BART.