Can we switch Casual attention to Self attention for SGPT?

ashokrajab commented 5 months ago

If I wanted to generate an embedding for a sentence using a decoder, should it necessarily follow a casual attention?

eg: This is a sample sentence.

lets say each word is a token. Now instead of sending in each token one by one and applying casual attention mask, to get the token embedding and then do a position-weighted mean pooling to get the sentence embedding...

Why can't we give the entire sentence all together and apply a self attention mask to get the sentence embedding?

I get that we are trying to stick to logic followed in the training process. But just wondering whether something like this should work.

Muennighoff commented 5 months ago

Yes this is possible - I will share models for this soon (~2 weeks), please stay tuned! Will update you here :)

Muennighoff commented 4 months ago

Sorry it took me a bit longer but just released decoder LMs trained for embedding with bidirectional attention - this is essentially the v2 of sgpt:

Paper: https://arxiv.org/abs/2402.09906
Model: https://huggingface.co/GritLM/GritLM-7B

Hope this is useful!

Muennighoff / sgpt

Can we switch Casual attention to Self attention for SGPT? #46