Open AshkanTaghipour opened 11 months ago
Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval
Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval
Thank you, for the BLIP-VQA evaluation, the final accuracy would be the average of the answers in the json file named "vqa_result.json" in "examples/annotation_blip/" directory?
Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!
Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval
Hello! And early Happy New Year! As you said, I can see BVAQ, CLIP, UniDet and MiniGPT-cot eval in your project, but B-CLIP and B-VQA-n are not in the project, right? I just want to check this question and Thank you for this work, it helps a lot!
Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!
Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds)
Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!
Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds)
Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!
Hi! The compared methods are all from the official repos, and the usages are explained in T2I-CompBench CLIP code can be found in CLIP_similarity, with the argument "--complex" False for Table 2. B-CLIP is to use BLIP to generate captions and use CLIP to calculate score. B-VQA-n is to use the direct prompt as the question without being disentangled. And the models are also from the official repos, with the same generator as in inference_eval
Hello! And early Happy New Year! As you said, I can see BVAQ, CLIP, UniDet and MiniGPT-cot eval in your project, but B-CLIP and B-VQA-n are not in the project, right? I just want to check this question and Thank you for this work, it helps a lot!
Hello! For B-CLIP, we use the same way as in Attend-and-Excite. You can use the official repo of BLIP to generate captions and use CLIP text-text similarities to calculate the similarity score of the generated captions and the ground truth.
For B-VQA-n, as explained in paper "BLIP-VQA-naive (denoted as
B-VQA-n) applies BLIP VQA to ask a single question (e.g., a green bench and a red car?) with the
whole prompt". All you need to do in BLIPvqa_eval/BLIP_vqa.py is (1) set np_num=1 (2) replace L#32-37 with image_dict['question']=f'{f}?'
.
Hope this helps!
Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!
Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds)
Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!
Thank you very much, is it possible to add the spatial checkpoints for GORS to the repo? thanks
Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!
Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds)
Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!
Thank you very much, is it possible to add the spatial checkpoints for GORS to the repo? thanks
The checkpoints are added in GORS_finetune/checkpoint. Thank you!
Yes, BLIP-VQA score is the average of the results from "vqa_result.json". We have updated BLIPvqa_eval/BLIP_vqa.py in L#117-120 to complete the calculation of the average. Thank you!
Thank you for your response, for the UniDet evaluation, there is one file ofter evaluation called vqa_result.json, however for the blip there are other files called color_test.json that help to find one to one mapping for the given image and the result, is it possible to add something similar for the UniDet as well? attached is the vqa_results.json for the UniDet evaluation, and as you know, analysing such a result would be hard while one is working with several generated images per prompt (several seeds)
Hello! The mapping file is added in UniDet_eval/determine_position_for_eval.py, and it will be saved as mapping.json in the same directory of vqa_result.json. Thank you!
Thank you very much, is it possible to add the spatial checkpoints for GORS to the repo? thanks
The checkpoints are added in GORS_finetune/checkpoint. Thank you!
Hello, thank you for releasing the code. About your finetuning process, (1) why you chose text encoder's learning rate as 5e-6? e.g. 5e-5 doesn't work? (2) did you find denoising loss gradually growing or it remained quite stable? (3) why you didn't consider unconditional denoising loss? e.g. randomly drop 10% text prompts to improve classifier-free guidance Many thanks!
(1) why you chose text encoder's learning rate as 5e-6? e.g. 5e-5 doesn't work? (2) did you find denoising loss gradually growing or it remained quite stable? (3) why you didn't consider unconditional denoising loss? e.g. randomly drop 10% text prompts to improve classifier-free guidance
Thanks for your questions! For (1): We chose the text encoder's learning rate as 5e-6 because it was set as the default value in the training script available at this link: link. For (2): The denoising loss exhibited some oscillations, but overall, it trended downwards. For (3): We did not consider unconditional denoising loss, such as randomly dropping 10% of text prompts to enhance classifier-free guidance. While this approach may improve performance, the focus was primarily on demonstrating the positive effects of the reweighting method on results of alignment, thus it was not experimented with in our case.
(1) why you chose text encoder's learning rate as 5e-6? e.g. 5e-5 doesn't work? (2) did you find denoising loss gradually growing or it remained quite stable? (3) why you didn't consider unconditional denoising loss? e.g. randomly drop 10% text prompts to improve classifier-free guidance
Thanks for your questions! For (1): We chose the text encoder's learning rate as 5e-6 because it was set as the default value in the training script available at this link: link. For (2): The denoising loss exhibited some oscillations, but overall, it trended downwards. For (3): We did not consider unconditional denoising loss, such as randomly dropping 10% of text prompts to enhance classifier-free guidance. While this approach may improve performance, the focus was primarily on demonstrating the positive effects of the reweighting method on results of alignment, thus it was not experimented with in our case.
Thank you very much for your reply, it is very clear. :)
Hi, Thank you for your interesting work and in depth analysis. i would like to ask about possibilty of releaseing the code of compared method in Table. 2