GAIR-NLP / anole

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
https://huggingface.co/spaces/ethanchern/Anole
618 stars 33 forks source link

questions on the released model and training #8

Closed ilovecv closed 1 month ago

ilovecv commented 1 month ago

Hi,

Thanks for sharing the awesome work!

For the released model, was it fine-tuned only with image tokens head or fully fine-tuned? From the paper, it looks like the first one.

For fully fine-tuning, https://github.com/GAIR-NLP/anole/blob/main/training/README.md in step 1: do we need to change modeling_chameleon.py?

Screenshot 2024-07-09 at 10 52 18 PM

I don't see any difference between original and modified.

EthanC111 commented 1 month ago

Thank you for your interest!

For the released model, was it fine-tuned only with image tokens head or fully fine-tuned? From the paper, it looks like the first one.

Yes it is the first one.

in step 1: do we need to change modeling_chameleon.py?

Yes! You just have to commented out the code! The original modeling_chameleon.py disabled the image logits.

ilovecv commented 1 month ago

Thanks for your quick reply!

For fully fine-tuning, if we train both image head and text head, how to preserve the text generation and image understanding capability? Did you try any experiments or can you share any insights?

BTW, did you have results with 30B?

EthanC111 commented 1 month ago

Thank you for your interest!

Fine-tuning Chameleon might be a bit tricky. Initially, we tried full parameter fine-tuning to facilitate image and multimodal generation from Chameleon, but it didn't really work. So, we decided to fine-tune only the image head, which did work. I suggest using Anole directly for further fine-tuning on your downstream tasks, as Anole already possesses full capabilities for multimodal understanding and generation.

We will be releasing the 30B version soon! Stay tuned!