Open Bigfield77 opened 3 days ago
Hello
Looking at llama 3-8b, it has 32 heads of (1,32,7,7) so around 50176 terms.
If this was flattened and compressed somehow to the input required for Pixart Sigma (300 tokens?)
like a latent text embedding space, could this be used to train a model from scratch? would this be any good?
I guess masking would need to be changed?
Hello
Looking at llama 3-8b, it has 32 heads of (1,32,7,7) so around 50176 terms.
If this was flattened and compressed somehow to the input required for Pixart Sigma (300 tokens?)
like a latent text embedding space, could this be used to train a model from scratch? would this be any good?
I guess masking would need to be changed?