NasirKhalid24 / CLIP-Mesh

Official implementation of CLIP-Mesh: Generating textured meshes from text using pretrained image-text models
MIT License
420 stars 37 forks source link

Diffusion Prior? #2

Closed dboshardy closed 1 year ago

dboshardy commented 1 year ago

Does the diffusion prior need to come from the pretrained model in the README, or can we swap in any prior?

That is, assuming the prior works with whatever code is loading it. So for instance if I wanted to swap in another Latent Diffusion model than DALLE-2-pytorch.

NasirKhalid24 commented 1 year ago

It currently supports the DALLE-2 prior whose parameters you can edit by changing the following lines in the config

# Text-Image Prior Related
prior_path:  weights/model.pth            # Path to weights for the prior network, not used if prior_path empty
# prior_path:                             # Leave empty like this to use only text prompt

## Parameters for diffusion prior network (code by lucidrains)
diffusion_prior_network_dim: 512
diffusion_prior_network_depth: 12
diffusion_prior_network_dim_head: 64
diffusion_prior_network_heads: 12
diffusion_prior_network_normformer: false

## Parameters for diffusion prior (code by lucidrains)
diffusion_prior_embed_dim: 512
diffusion_prior_timesteps: 1000
diffusion_prior_cond_drop_prob: 0.1
diffusion_prior_loss_type: l2
diffusion_prior_condition_on_text_encodings: false

For any other diffusion model you would need to update the loop.py file under the code commented with # Setup Prior model & get image prior (text embed -> image embed) to load in the different diffusion model

dboshardy commented 1 year ago

@NasirKhalid24 I started looking into this and got it working with the huggingface transformers version of CLIP and Stable Diffusion, but the output isn't very good. It slightly deforms the starting sphere but never gets close to a table in the log images. I'm guessing this is an issue with the loss function? Should I also change the loss function? Is cosine avg because you used DALLE-2-pytorch?

NasirKhalid24 commented 1 year ago

I think you will have to experiment quite a bit with the different parameters and overall methodology. The integration of Stable Diffusion will play a big role since our work does not support anything like it which uses a text -> image generator (instead we use a text embedding -> image embedding generator [DALLE2 prior])

dboshardy commented 1 year ago

@NasirKhalid24 Shouldn't the CLIP image and text embeddings/encodings used by Stable Diffusion work in the same way, though?

NasirKhalid24 commented 1 year ago

The different CLIP model sizes can lead to different results and I believe stable diffusion may use ViT-L/14 so you would need to adjust parameters to find best ones for this case