GAIR-NLP / anole

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
https://huggingface.co/spaces/ethanchern/Anole
618 stars 33 forks source link

the difference between training and facilitating_image_generation #3

Closed vd001 closed 1 month ago

vd001 commented 1 month ago

I find that "facilitating_image_generation" only finetunes params with respect to image token ids of the output layer, In what scenarios is this resulting model suitable for, such as unconditional image generation?

the "training" project finetunes all the params, might be a good solution for various downstream applications.

JoyBoy-Su commented 1 month ago

Thank you for your interest! In fact, 'facilitating_image_generation' is intended to facilitate chameleon's image generation capabilities. If you want to fine-tune the original chameleon with your own dataset for better image generation, you can refer to the code in this section.

EthanC111 commented 1 month ago

Hi, thanks for your interest! We discovered that fine-tuning only the image head can successfully facilitate the vision generation capabilities from Chameleon. In contrast, direct full fine-tuning might not be the best approach for facilitating Chameleon's vision generation capabilities. However, since Anole has already facilitated these capabilities from Chameleon, standard full-parameter fine-tuning should work if you want to finetune Anole. We are still working on finding the best fine-tuning method for Anole and Chameleon on downstream tasks and will keep you updated on any progress we make!