Compressed Multi-head attention as embedding

PixArt-alpha / PixArt-sigma

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

GNU Affero General Public License v3.0

1.44k stars 67 forks source link

Open Bigfield77 opened 3 days ago

Bigfield77 commented 3 days ago

Hello

Looking at llama 3-8b, it has 32 heads of (1,32,7,7) so around 50176 terms.

If this was flattened and compressed somehow to the input required for Pixart Sigma (300 tokens?)

like a latent text embedding space, could this be used to train a model from scratch? would this be any good?

I guess masking would need to be changed?