Closed SinanAkkoyun closed 5 months ago
I think intuitively no other things needed for LLaMA architecture, since Fuyu's arch is Persimmon, it's a LLaMA arch with a specially designed tokenizer with large vocabulary.
@Luodian Thank you for your reply! :) So basically, the 'linear projection' they are talking about is just the tokenizer?
My question is: Is the pixel-by-pixel linear projection encoding just taken care of by the tokenizer?
Thanks again!
@Luodian Thank you for your reply! :) So basically, the 'linear projection' they are talking about is just the tokenizer?
My question is: Is the pixel-by-pixel linear projection encoding just taken care of by the tokenizer?
Thanks again!
No, it's not only the tokenizer, it still has a trainable linear projection layer. It uses the linear projection layer to transform each patches (30x30 pixels) into config.hidden_size
(4096).
You can see the modeling_fuyu.py
for details.
self.vision_embed_tokens = nn.Linear(
config.patch_size * config.patch_size * config.num_channels,
config.hidden_size,
)
For the linear_projection
behavior, you could see here:
https://github.com/huggingface/transformers/blob/89439fea6458d1a430c6dbcadb983937416090fd/src/transformers/models/fuyu/modeling_fuyu.py#L292
Ahhh thank you so much! I've tried to make sense of modeling_fuyu before but now it clicked, thank you so much!
Thanks for your great work! Upon trying to understand the image processor implementation, it seemed as if the raw pixel values are being fed into the model alongside the tokenized text. I suppose the model itself then does a linear projection etc.
If one would want to add Fuyu-like image patch understanding to Llama, what exactly would one need to add to the Llama architecture?
Thank you so much!