This repository contains the official implementation of ContextDiff published in ICLR 2024.
Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing
Ling Yang, [Zhilong Zhang](), Zhaochen Yu, [Jingwei Liu](), Minkai Xu, Stefano Ermon, Bin Cui
Peking University, Stanford University
Overview of our ContextDiff |
We propose a novel and general cross-modal contextualized diffusion model (ContextDiff) that harnesses cross-modal context to facilitate the learning capacity of cross-modal diffusion models, including text-to-image generation, and text-guided video editing.
[2024.1] Our main code along with demo images and videos is released.
Source(Sunflower) | Edited(Rose) | Source(Sunflower) | Edited(Carnation) |
Source(Schnauzer) | Edited(Golden) | Source(Schnauzer) | Edited(Husky) |
Source(Bird) | Edited(Squirrel) | Source(Cat) | Edited(Dog) |
Source(Road) | Edited(Frozen lake) | Source(Woman) | Edited(Astronaut) |
Source(Shark) | Edited(Gold Fish) | Source(Mallard) | Edited(Black swam) |
Environment Setup
git clone https://github.com/YangLing0818/ContextDiff.git
conda create -n ContextDiff python==3.8
conda activate ContextDiff
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
Install Xformers to Save Memory
We recommend to use xformers to save memory:
wget https://github.com/ShivamShrirao/xformers-wheels/releases/download/4c06c79/xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
pip install xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
Download Model Weights
Here we choose Stable Diffusion as our diffusion backbone, you can download the model weights using our download.py in folder 'ckpt/'.
cd ckpt
python download.py
wget "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt"
wget "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt"
wget "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt"
cd ..
Finetune with ContextDiff
You can reproduce our video editing results by running:
CUDA_VISIBLE_DEVICES=0 python ContextDiff_finetune.py --config config/rose.yaml
You can also try your own video samples by using personalized config file, you should use put the video frames into './data' folders, and config file into './config'. Please note that using the adapter/shifter pretrained from text-to-image generation part would further eanhance semantic alignment of the edited videos.
The edited videos and finetuned checkpoint are placed in './result'
result
βββ name
β βββ checkpoint_50
β βββ checkpint_100
| .....
| βββ checkpoint_200
| βββ sample
| | βββ sample_50
......
@inproceedings{yang2024crossmodal,
title={Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing},
author={Ling Yang and Zhilong Zhang and Zhaochen Yu and Jingwei Liu and Minkai Xu and Stefano Ermon and Bin CUI},
booktitle={International Conference on Learning Representations},
year={2024}
}