Karine-Huang / T2I-CompBench

[Neurips 2023] T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
https://arxiv.org/pdf/2307.06350.pdf
MIT License
168 stars 5 forks source link

Code for compared methods #12

Open AshkanTaghipour opened 7 months ago

AshkanTaghipour commented 7 months ago

Hi, Thank you for your interesting work and in depth analysis. i would like to ask about possibilty of releaseing the code of compared method in Table. 2

Karine-Huang commented 7 months ago

Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval

AshkanTaghipour commented 6 months ago

Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval

Thank you, for the BLIP-VQA evaluation, the final accuracy would be the average of the answers in the json file named "vqa_result.json" in "examples/annotation_blip/" directory? image

Karine-Huang commented 6 months ago

Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!

Retr0573 commented 6 months ago

Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval

Hello! And early Happy New Year! As you said, I can see BVAQ, CLIP, UniDet and MiniGPT-cot eval in your project, but B-CLIP and B-VQA-n are not in the project, right? I just want to check this question and Thank you for this work, it helps a lot!

AshkanTaghipour commented 6 months ago

Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!

Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds) image

Karine-Huang commented 6 months ago

Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!

Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds) image

Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!

Karine-Huang commented 6 months ago

Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval

Hello! And early Happy New Year! As you said, I can see BVAQ, CLIP, UniDet and MiniGPT-cot eval in your project, but B-CLIP and B-VQA-n are not in the project, right? I just want to check this question and Thank you for this work, it helps a lot!

Hello! For B-CLIP, we use the same way as in Attend-and-Excite. You can use the official repo of BLIP to generate captions and use CLIP text-text similarities to calculate the similarity score of the generated captions and the ground truth.

For B-VQA-n, as explained in paper "BLIP-VQA-naive (denoted as B-VQA-n) applies BLIP VQA to ask a single question (e.g., a green bench and a red car?) with the whole prompt". All you need to do in BLIPvqa_eval/BLIP_vqa.py is (1) set np_num=1 (2) replace L#32-37 with image_dict['question']=f'{f}?' .

Hope this helps!

AshkanTaghipour commented 6 months ago

Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!

Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds) image

Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!

Thank you very much, is it possible to add the spatial checkpoints for GORS to the repo? thanks

Karine-Huang commented 6 months ago

Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!

Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds) image

Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!

Thank you very much, is it possible to add the spatial checkpoints for GORS to the repo? thanks

The checkpoints are added in GORS_finetune/checkpoint. Thank you!

Chao0511 commented 5 months ago

Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!

Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds) image

Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!

Thank you very much, is it possible to add the spatial checkpoints for GORS to the repo? thanks

The checkpoints are added in GORS_finetune/checkpoint. Thank you!

Hello, thank you for releasing the code. About your finetuning process, (1) why you chose text encoder's learning rate as 5e-6? e.g. 5e-5 doesn't work? (2) did you find denoising loss gradually growing or it remained quite stable? (3) why you didn't consider unconditional denoising loss? e.g. randomly drop 10% text prompts to improve classifier-free guidance Many thanks!

Karine-Huang commented 5 months ago

(1) why you chose text encoder's learning rate as 5e-6? e.g. 5e-5 doesn't work? (2) did you find denoising loss gradually growing or it remained quite stable? (3) why you didn't consider unconditional denoising loss? e.g. randomly drop 10% text prompts to improve classifier-free guidance

Thanks for your questions! For (1): We chose the text encoder's learning rate as 5e-6 because it was set as the default value in the training script available at this link: link. For (2): The denoising loss exhibited some oscillations, but overall, it trended downwards. For (3): We did not consider unconditional denoising loss, such as randomly dropping 10% of text prompts to enhance classifier-free guidance. While this approach may improve performance, the focus was primarily on demonstrating the positive effects of the reweighting method on results of alignment, thus it was not experimented with in our case.

Chao0511 commented 5 months ago

(1) why you chose text encoder's learning rate as 5e-6? e.g. 5e-5 doesn't work? (2) did you find denoising loss gradually growing or it remained quite stable? (3) why you didn't consider unconditional denoising loss? e.g. randomly drop 10% text prompts to improve classifier-free guidance

Thanks for your questions! For (1): We chose the text encoder's learning rate as 5e-6 because it was set as the default value in the training script available at this link: link. For (2): The denoising loss exhibited some oscillations, but overall, it trended downwards. For (3): We did not consider unconditional denoising loss, such as randomly dropping 10% of text prompts to enhance classifier-free guidance. While this approach may improve performance, the focus was primarily on demonstrating the positive effects of the reweighting method on results of alignment, thus it was not experimented with in our case.

Thank you very much for your reply, it is very clear. :)