Closed essamsleiman closed 1 year ago
Hi,
Referring to the prompt used in Wingorground evaluation, the original metric of Wingorground requires the similarity score of each image-text pair to calculate its final metric. It employs a Clip-style model, facilitating easy retrieval of the image/text pair score. For our autoregressive model, we have conceived the following prompt:
"The image 0 is
It's worth noting that before feeding the two images and captions of the Wingorground into the model, we randomly shuffle them. This approach prevents the model from selecting any potential shortcuts.
Thank you very much for your response! Can you also point me to the code where you calculate the logit probability of "Yes" in your code? Thanks!
Well, we are working on finishing the evalutaion code, so it has not been uploaded. Here is the eval code for Winoground: For each instances, we divide it into four groups according to the above method, record the output of these four groups at once, and record the logit of the first token corresponding to the "Yes" token as the image-caption socre.
i0c1 = model.generate ( pixel_values = inputs['pixel_values'],
input_ids = inputs['input_ids'][:,0],
attention_mask = inputs['attention_mask'][:,0],
img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)
i0c0 = model.generate ( pixel_values = inputs['pixel_values'],
input_ids = inputs['input_ids'][:,1],
attention_mask = inputs['attention_mask'][:,1],
img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)
i1c1 = model.generate ( pixel_values = inputs['pixel_values'],
input_ids = inputs['input_ids'][:,2],
attention_mask = inputs['attention_mask'][:,2],
img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)
i1c0 = model.generate ( pixel_values = inputs['pixel_values'],
input_ids = inputs['input_ids'][:,3],
attention_mask = inputs['attention_mask'][:,3],
img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)
i0c1 = i0c1.scores[0][:,2163]
i0c0 = i0c0.scores[0][:,2163]
i1c1 = i1c1.scores[0][:,2163]
i1c0 = i1c0.scores[0][:,2163]
And eventually calculate the socre as the Winoground metric:
acc = (((img0c0>img0c1 ) & (img1c1>img1c0 )).sum()) / len(img0c0)
img_acc = (((img0c0>img1c0 )& (img1c1>img0c1 )).sum()) / len(img0c0)
group_acc = (((img0c0>img0c1 ) & (img1c1>img1c0 )& (img0c0>img1c0 )& (img1c1>img0c1 )).sum()) / len(img0c0)
Hi thanks a lot for this incredible work! I had one question - what did the Winoground evaluation prompt look like? I'd really appreciate it if you can point me to this.
Thanks.