Are you going to try encoder-decoder transformer?

bobby-he / simplified_transformers

MIT License

278 stars 25 forks source link

Are you going to try encoder-decoder transformer? #2

Open dinhanhx opened 10 months ago

dinhanhx commented 10 months ago

For example, you can compare with BART.

dinhanhx commented 10 months ago

I notice that there are some asserts to make sure the model doesn't accept encoder hidden state and cross attention.

https://github.com/bobby-he/simplified_transformers/blob/1d8eb6db9835f82c7eb91db551d02123991fb683/simplified_transformers/model_utils.py#L197

https://github.com/bobby-he/simplified_transformers/blob/1d8eb6db9835f82c7eb91db551d02123991fb683/simplified_transformers/model_utils.py#L408

Let say, I want to implement encoder-decoder-ish transformer. How can I use with your simplified transformer blocks?

bobby-he commented 10 months ago

Hi, thanks for the question. To be honest we haven't thought much about the encoder-decoder attention. I think it would be compatible with our simplified transformer block, with some additional modifications to shaped attention to account for the encoder representations, though I haven't tried it out. If there was interest in implementing it to test if it works I would be happy to be involved.

dinhanhx commented 10 months ago

Yesterday I tried to read your code. I'm still not confident. Maybe I haven't understood well enough. I also think it's doable with a bit modification to your layer. Obviously, your code is based on GPT2Attention

https://github.com/huggingface/transformers/blob/v4.35.1/src/transformers/models/gpt2/modeling_gpt2.py#L123

https://github.com/bobby-he/simplified_transformers/blob/main/simplified_transformers/model_utils.py#L191

If there was interest in implementing it to test if it works I would be happy to be involved.

I may want to implement such things. But I think I will have lots of questions about shaped attention.

dinhanhx commented 10 months ago

So, compared to the OG Transformer block, eq 13rd doesn't have weight for V and does have some extra alpha, beta, gamma weight, right?