š¤ Demo [Coming Soon] š Paper š¦ Twitter
In this work, we propose LLMScore, a new framework that offers evaluation scores with multi-granularity compositionality. LLMScore leverages the large language models (LLMs) to evaluate text-to-image models. Please check out our paper "LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation". Our work is accepted at NeurIPS 2023!
The two images are generated using Stable-Diffusion-2 based on the text prompt sampled from the Concept Conjunction dataset. Baseline section shows the scores from the existing model-based evaluation metrics, Human section is the rating score from the human evaluation, LLMScore section is our proposed metric. The right column also shows the rationale generated by LLMScore.
Comparison of Text-Image Matching, Sentence Matching, and our LLM-based Instruction-Following Matching pipeline for text-to-image synthesis evaluation. Our LLMScore automatically provides accurate scores and reasonable rationales for text-to-image synthesis based on text prompts, and visual descriptions following various evaluation instructions.
Please follow install page to set up the environments and models.
Get score with rationale for evaluating the alignment between image and text prompt.
python llm_score.py --image sample/sample.png --text_prompt "a red car and a white sheep"
Try different LLMs by setting LLM_ID as one of ["gpt-4", "gpt-3.5-turbo", "vicuna"]:
python llm_score.py --image sample/sample.png --text_prompt "a red car and a white sheep" --llm_id LLM_ID
Notice that to use Vicuna, follow Part Install and Part Model Weights in FastChat_README to install fastchat and to obtain the Vicuna weights. To enable OpenAI-compastible APIs used in our repo, follow commands from Guideline to launch the controller, model worker and RESTful API server as below:
python3 -m fastchat.serve.controller
python3 -m fastchat.serve.model_worker --model-name 'vicuna-7b-v1.1' --model-path /path/to/vicuna/weights
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
The rank correlation (Kendall's tau) is aggregated across the compositional prompt dataset (Concept Conjunction, Attribute Binding Contrast) on the left two columns (CompBench) and the general prompt dataset (MSCOCO, DrawBench, PaintSkills) on the right two columns (GeneralBench).
If you found this repository useful, please consider cite our paper:
@misc{lu2023llmscore,
title={LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation},
author={Yujie Lu and Xianjun Yang and Xiujun Li and Xin Eric Wang and William Yang Wang},
year={2023},
eprint={2305.11116},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repo benefits from BLIP-2, GRIT, GPT-4. Thank for their awesome works!