HaozheZhao / MIC

MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU
335 stars 15 forks source link

Winoground Eval #19

Closed essamsleiman closed 1 year ago

essamsleiman commented 1 year ago

Hi thanks a lot for this incredible work! I had one question - what did the Winoground evaluation prompt look like? I'd really appreciate it if you can point me to this.

Thanks.

HaozheZhao commented 1 year ago

Hi, Referring to the prompt used in Wingorground evaluation, the original metric of Wingorground requires the similarity score of each image-text pair to calculate its final metric. It employs a Clip-style model, facilitating easy retrieval of the image/text pair score. For our autoregressive model, we have conceived the following prompt: "The image 0 is {visual prompt}.\n The image 1 is {visual prompt}.\n The caption 0 is {caption xxx}.\n The caption 1 is {caption xxx}.\n Use the images and captions to answer the following question. Is the caption 1 matches the image 1?Yes or No. " We extract the logit probability of the word "Yes" from the output, using it as the score for the "caption 1 & image 1" pair. The statement "Is the caption 1 matches the image 1?" will be transformed into combinations of images and captions, which will subsequently be fed into the VLM four times to yield four scores.

It's worth noting that before feeding the two images and captions of the Wingorground into the model, we randomly shuffle them. This approach prevents the model from selecting any potential shortcuts.

essamsleiman commented 1 year ago

Thank you very much for your response! Can you also point me to the code where you calculate the logit probability of "Yes" in your code? Thanks!

HaozheZhao commented 1 year ago

Well, we are working on finishing the evalutaion code, so it has not been uploaded. Here is the eval code for Winoground: For each instances, we divide it into four groups according to the above method, record the output of these four groups at once, and record the logit of the first token corresponding to the "Yes" token as the image-caption socre.

                    i0c1 = model.generate ( pixel_values = inputs['pixel_values'],
                                input_ids = inputs['input_ids'][:,0],
                                attention_mask = inputs['attention_mask'][:,0],
                                img_mask = inputs['img_mask'], max_length=5,  output_scores=True, return_dict_in_generate=True)
                    i0c0 = model.generate ( pixel_values = inputs['pixel_values'],
                                input_ids = inputs['input_ids'][:,1],
                                attention_mask = inputs['attention_mask'][:,1],
                                img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)  
                    i1c1 = model.generate ( pixel_values = inputs['pixel_values'],
                                input_ids = inputs['input_ids'][:,2],
                                attention_mask = inputs['attention_mask'][:,2],
                                img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)  
                    i1c0 = model.generate ( pixel_values = inputs['pixel_values'],
                                input_ids = inputs['input_ids'][:,3],
                                attention_mask = inputs['attention_mask'][:,3],
                                img_mask = inputs['img_mask'], max_length=5, output_scores=True, return_dict_in_generate=True)    
                i0c1 = i0c1.scores[0][:,2163]
                i0c0 = i0c0.scores[0][:,2163]
                i1c1 = i1c1.scores[0][:,2163]
                i1c0 = i1c0.scores[0][:,2163]

And eventually calculate the socre as the Winoground metric:

        acc = (((img0c0>img0c1 ) & (img1c1>img1c0 )).sum()) / len(img0c0)
        img_acc = (((img0c0>img1c0 )& (img1c1>img0c1 )).sum()) / len(img0c0)
        group_acc = (((img0c0>img0c1 ) & (img1c1>img1c0 )& (img0c0>img1c0 )& (img1c1>img0c1 )).sum()) / len(img0c0)