GAIR-NLP / anole

Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation
https://huggingface.co/spaces/ethanchern/Anole
618 stars 33 forks source link

questions about the image generation? #9

Open mutonix opened 1 month ago

mutonix commented 1 month ago

Thanks for bringing the great work!

I have some questions about image generation in Anole. Does Anole utilize the vqgan decoder from Chameleon? As Chameleon has also released the vqgan weight for image generation (though they claim the image generation function is banned), what new things are added in Anole?

Many thanks!

EthanC111 commented 1 month ago

Thank you for your interest!

Yes, Anole uses the same vqgan from Chameleon. As you mentioned, the open-sourced version of Chameleon doesn't support vision generation. Anole facilitates image and multimodal generation capabilities from Chameleon. We will upgrade Anole with new functionality soon! Stay tuned for more updates!

mutonix commented 1 month ago

Thank you for the quick reply. I am curious about if we directly use the chameleon vqgan to generate images, will it work? Or does it have to be fine-tuned like what Anole had done to activate the image generation capability? Do you experiment on directly applying the Chameleon vqgan without fine-tuning?

EthanC111 commented 1 month ago

The VQGAN part seems to work pretty well. According to our experiments, the reconstructed images looks pretty much the same as the original image.

EthanC111 commented 1 month ago

We did not change the VQGAN tokenizer.

matbee-eth commented 1 month ago

Thank you for your interest!

Yes, Anole uses the same vqgan from Chameleon. As you mentioned, the open-sourced version of Chameleon doesn't support vision generation. Anole facilitates image and multimodal generation capabilities from Chameleon. We will upgrade Anole with new functionality soon! Stay tuned for more updates!

I would love a semi-descriptive (ELI a 40 year old full stack eng) writeup on how this is achieved

EthanC111 commented 1 month ago

I would love a semi-descriptive (ELI a 40 year old full stack eng) writeup on how this is achieved

@matbee-eth Thank you for your interest! This is our paper: https://arxiv.org/abs/2407.06135

mutonix commented 1 month ago

Can you further explain this question? Many thanks!

Does Chameleon have to be fine-tuned like what Anole has done to activate the intrinsic image generation capability that is banned? Do you experiment with directly applying the Chameleon original weights to generate images without fine-tuning (as vqgan decoder weights are provided by meta and Chameleon theoretically can generate images without fine-tuning)?

b2r66sun commented 1 month ago

I 'd also like to know whether you tune the vqgan or directly use the weight from chameleon. Many thanks

EthanC111 commented 1 month ago

Hi @mutonix , Chameleon doesn't support image generation. For more information, please see this issue. Anole is fine-tuned from Chameleon to facilitate image generation and multimodal generation. Hi @b2r66sun , we did not tune the VQGAN. We directly use the VQGAN provided by Chameleon.

mutonix commented 1 month ago

In the issue you have mentioned, he does not mention that he has commented out the following code or similar code in the original chameleon implementation:

image_tokens = self.model.vocabulary_mapping.image_tokens
logits[:, :, image_tokens] = torch.finfo(logits.dtype).min

Maybe that is the reason why he can not get the correct images. Have you tried to comment out the above code to directly generate the image? My confusion is that only fine-tuning can activate the image generation capability or just commenting out a few lines of code is ok.

AbrahamSanders commented 1 month ago

@mutonix I can't find anything like that in the original Chameleon implementation, only in the transformers version distributed with Anole (for fine-tuning purposes?): modeling_chameleon.py#L1627

I tried swapping the original Chameleon 7b weights for Anole 7b and running the original Chameleon Miniviewer. It appears to be capable of generating coherent images only when using the Anole weights.

Yuheng-Li commented 1 month ago

The original Chameleon seems released the vqgan's decoder, then how does the Chameleon banned the image generation ability?

what does Anole do to activate this ability? For example, Does Chameleon remove the logits corresponding to image tokens in the last layer, and Anole added it back?