Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.52k stars 241 forks source link

Architecture question OtterHD #334

Closed SinanAkkoyun closed 5 months ago

SinanAkkoyun commented 5 months ago

Thanks for your great work! Upon trying to understand the image processor implementation, it seemed as if the raw pixel values are being fed into the model alongside the tokenized text. I suppose the model itself then does a linear projection etc.

If one would want to add Fuyu-like image patch understanding to Llama, what exactly would one need to add to the Llama architecture?

Thank you so much!

Luodian commented 5 months ago

I think intuitively no other things needed for LLaMA architecture, since Fuyu's arch is Persimmon, it's a LLaMA arch with a specially designed tokenizer with large vocabulary.

https://www.adept.ai/blog/persimmon-8b

SinanAkkoyun commented 5 months ago

@Luodian Thank you for your reply! :) So basically, the 'linear projection' they are talking about is just the tokenizer?

My question is: Is the pixel-by-pixel linear projection encoding just taken care of by the tokenizer?

Thanks again!

Luodian commented 5 months ago

@Luodian Thank you for your reply! :) So basically, the 'linear projection' they are talking about is just the tokenizer?

My question is: Is the pixel-by-pixel linear projection encoding just taken care of by the tokenizer?

Thanks again!

No, it's not only the tokenizer, it still has a trainable linear projection layer. It uses the linear projection layer to transform each patches (30x30 pixels) into config.hidden_size (4096).

You can see the modeling_fuyu.py for details.

self.vision_embed_tokens = nn.Linear(
    config.patch_size * config.patch_size * config.num_channels,
    config.hidden_size,
)

For the linear_projection behavior, you could see here: https://github.com/huggingface/transformers/blob/89439fea6458d1a430c6dbcadb983937416090fd/src/transformers/models/fuyu/modeling_fuyu.py#L292

SinanAkkoyun commented 5 months ago

Ahhh thank you so much! I've tried to make sense of modeling_fuyu before but now it clicked, thank you so much!