The MIG benchmark of CVPR2024 MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
In the Text-to-Image task, facing complex texts with multiple instances and rich attributes along with their layout information, higher demands are placed on existing generators and their derived generation techniques. In order to evaluate the generation capability of these techniques on complex instances and attributes, we designed the COCO-MIG benchmark.
The MIG bench is based on COCO images and their layouts, using the color attribute of instances as the starting point. It filters out layouts with smaller areas and instances related to humans and assigns a random color to each instance. Through specific templates, it can also construct a global prompt for each image. This bench, constructed in this way, not only retains the relatively natural distribution of COCO but also introduces complex attributes and counterfactual cases through random color assignment, greatly increasing the difficulty of generation, thus making it challenging.
During evaluation, we utilize the GroundedSAM model to detect and segment each instance. We then analyze the distribution of colors in the HSV color space for each object and calculate the proportion of the corresponding color to determine if the object's color meets the requirements. By calculating the proportion of instances correctly generated in terms of attributes and positions, along with their MIOU, we reflect the model's performance in position and attribute control.
You can find more details in our Paper.
conda create --name eval_mig python=3.8 -y
conda activate eval_mig
conda config --append channels conda-forge
conda install pytorch==1.11.0 torchvision cudatoolkit=11.3.1 -c pytorch
export AM_I_DOCKER=False
export BUILD_WITH_CUDA=True
export CUDA_HOME=/path/to/cuda-11.3/
python -m pip install -e segment_anything
python -m pip install -e GroundingDINO
pip install opencv-python pycocotools matplotlib onnxruntime onnx nltk imageio supervision==0.7.0 protobuf==3.20.2 pytorch_fid
Note that you should install GroundingDINO on the GPU in order to properly run the evaluation code with cuda. If you encounter problems, you can refer to Issue for more details.
To run the evaluation process, you need to download some model weights.
Download the GroundingDINO checkpoint:
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
You shoule also download ViT-H SAM model in SAM
You can also manually download the weights for Bert.
If you want to test CLIP scores, you'll also need to download the CLIP model weights.
Put all these checkpoints under ../pretrained/ folder:
├── pretrained
│ ├── bert-base-uncased
│ │ ├── config.json
│ │ ├── pytorch_model.bin
│ │ ├── tokenizer_config.json
│ │ ├── tokenizer.json
│ │ └── vocab.txt
│ ├── clip
│ │ ├── config.json
│ │ ├── merges.txt
│ │ ├── preprocessor_config.json
│ │ ├── pytorch_model.bin
│ │ ├── special_tokens_map.json
│ │ ├── tokenizer_config.json
│ │ ├── tokenizer.json
│ │ └── vocab.json
│ ├── groundingdino_swint_ogc.pth
│ └── sam_vit_h_4b8939.pth
You can choose to resample prompts for evaluation. You can check the entire steps of Resampling.
You can also generate your image on the 800 prompts that have been sampled from MIG-Bench.
Use the sampled prompts and layouts to generate images.
You can try our MIGC method, hope you enjoy it.
Finally, you can start evaluating your model now.
python eval_mig.py \
--need_miou_score \
--need_instance_sucess_ratio \
--metric_name 'eval' \
--image_dir /path/of/image/
We re-sampled a version of the COCO-MIG benchmark, filtering out examples related to humans. Based on the new version of bench, we sampled 800 images and compared them with InstanceDiffusion, GLIGEN, etc. On MIG-Bench, the results are shown below. You can also find the image results and bench layout information that we generate in some of the methods in the Example.
Method | MIOU↑ | Instance Success Rate↑ | Model Type | Publication | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
L2 | L3 | L4 | L5 | L6 | Avg | L2 | L3 | L4 | L5 | L6 | Avg | |||
Box-Diffusion | 0.37 | 0.33 | 0.25 | 0.23 | 0.23 | 0.26 | 0.28 | 0.24 | 0.14 | 0.12 | 0.13 | 0.16 | Training-free | ICCV2023 |
Gligen | 0.37 | 0.29 | 0.253 | 0.26 | 0.26 | 0.27 | 0.42 | 0.32 | 0.27 | 0.27 | 0.28 | 0.30 | Adapter | CVPR2023 |
ReCo | 0.55 | 0.48 | 0.49 | 0.47 | 0.49 | 0.49 | 0.63 | 0.53 | 0.55 | 0.52 | 0.55 | 0.55 | Full model tuning | CVPR2023 |
InstanceDiffusion | 0.52 | 0.48 | 0.50 | 0.42 | 0.42 | 0.46 | 0.58 | 0.52 | 0.55 | 0.47 | 0.47 | 0.51 | Adapter | CVPR2024 |
Ours | 0.64 | 0.58 | 0.57 | 0.54 | 0.57 | 0.56 | 0.74 | 0.67 | 0.67 | 0.63 | 0.66 | 0.66 | Adapter | CVPR2024 |
MIG-Bench is based on GroundedSAM, SAM,CLIP, Bert and GroundingDINO. We appreciate their outstanding contributions.
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{zhou2024migc,
title={MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis},
author={Dewei Zhou and You Li and Fan Ma and Xiaoting Zhang and Yi Yang},
year={2024},
eprint={2402.05408},
archivePrefix={arXiv},
primaryClass={cs.CV}
}