apple / ml-4m

4M: Massively Multimodal Masked Modeling
https://4m.epfl.ch
Apache License 2.0
1.62k stars 97 forks source link

Example of generating image pixels from ImageBind modality #4

Closed SecretMG closed 5 months ago

SecretMG commented 5 months ago

Thanks for your excellent work!

I would like to inquire if you could provide some examples or documentation on how to use 4m to generate images from ImageBind feature or tokens. Your guidance on this matter would be greatly appreciated.

Thank you for your time and assistance.

ofkar commented 5 months ago

Hi @SecretMG,

You can repurpose the demo notebook for this by defining 'tok_imagebind@224' as conditioning domain and 'tok_rgb@224' as the target domain. Here is a minimal example to do that (you can add more target domains before predicting RGB too, e.g. for better grounding).

cond_domains = ['tok_imagebind@224']
target_domains = ['tok_rgb@224']
tokens_per_target = [196]
autoregression_schemes = ['roar']

You can play with other generation parameters, e.g. 'decoding_steps', 'token_decoding_schedules', 'temps', etc to control the generation diversity vs fidelity. Please take a look at the generation readme for tips and details.

To compute 'tok_imagebind@224' for a given image, you can first compute the ImageBind features using the pretrained ImageBind model and then tokenize the features using the ImageBind tokenizer we provided (you can see it under Load tokenizers section of the notebook):

'tok_imagebind': VQVAE.from_pretrained('EPFL-VILAB/4M_tokenizers_ImageBind-H14_8k_224-448').eval().to(device),

Alternatively, you can predict 'tok_imagebind@224' directly from the RGB input using the same notebook, and then use it as the conditioning as exemplified at the top. This would be a circular generation of course, i.e. RGB->ImageBind tokens->RGB tokens, but you can use it as a sanity check for how the generation should behave.

Best, Oguzhan