✨ VideoRepair can (1) detect misalignments by generating fine-grained evaluation questions and answering, (2) plan refinement, (3) decompose the region and finally (4) conduct localized refinement.
You can install all packages from requirements.txt
.
conda create -n videorepair python==3.10
conda activate videorepair
pip install -r requirements.txt
Additionally, for Semantic-SAM, you should install detectron2 like below:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
Our VideoRepair is based on GPT4 / GPT4o, so you need to setup your Azure OpenAI API config in the below files. You can find your keys in Azure Portal. We recommend using python-dotenv to store and load your keys.
DSG/openai_utils.py
DSG/dsg_questions_gen.py
DSG/query_utils.py
DSG/vqa_utils.py
client = AzureOpenAI(
azure_endpoint = # your keys,
api_key= # your keys,
api_version=# your keys,
)
VideoRepair's region decomposition is based on Molmo, Semantic-SAM, you can download it as follows:
Also, for initial video generation, you should setup your t2v models. In our main paper, we use VideoCrafter2 and T2V-turbo.
We provide demo (run_demo.sh
) for your own prompt! This demo use main_iter_demo.py
.
output_root="your output root"
prompt="your own prompt"
CUDA_VISIBLE_DEVICES=1,2 python main_iter_demo.py --prompt="$prompt" \
--model="t2vturbo" \ # base t2v-model
--output_root="$output_root" \
--seed=123 \ # global random seed (use for initial video generation)
--load_molmo \
--selection_score='dsg_blip' \ # video ranking method
--round=1 \
--seed=369 # localized generation seeds
VideoRepair is tested on EvalCrafter and T2V-CompBench.
We provide our $dsg^{obj}$ questions in ./datasets
. The structure is like below:
./datasets
├── compbench
│ ├── consistent_attr.json
│ ├── numeracy.json
│ ├── spatial_relationship.json
├── evalcrafter
│ ├── dsg_action.json
│ ├── dsg_color.json
│ ├── dsg_count.json
│ ├── dsg_none.json
Based on above question set, you can run benchmarks as follows:
output_root="your output path" # output path
eval_sections=("count", "action", "color") # eval dimension for each benchmark (e.g., count, )
for section in "${eval_sections[@]}"
do
CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
--output_root="$output_root" \
--eval_section="$section" \
--model='t2vturbo' \ # t2v model backbone
--load_molmo \
--selection_score='dsg_blip' \ # video ranking metric
--seed=123 \ # random seed
--round=1 \ # iteration round
--k=10 \ # number of video candidates
--div_seeds # use diverse seed per iterative rounds.
done
💗 If you enjoy our VideoRepair and find some beneficial things, citing our paper would be the best support for us!
@misc{lee2024videorepair,
title={VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement},
author={Daeun Lee and Jaehong Yoon and Jaemin Cho and Mohit Bansal},
year={2024},
eprint={2404.xxxx},
archivePrefix={arXiv},
primaryClass={cs.CV}
}