arijitray1993 / COLA

COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!
MIT License
22 stars 0 forks source link

Baseline CLIP Model #4

Open vishwa27yvs opened 3 hours ago

vishwa27yvs commented 3 hours ago

I am trying to reproduce the Baseline CLIP results for Single object GQA setting. I am getting a much lower mAP of 0.18, which does not match the paper's numbers, I am using the pooled output of CLIP's text and image encoder, does this match the paper's implementation. I am using openai/clip-vit-base-patch32

arijitray1993 commented 3 hours ago

This was the code used for the baseline CLIP:

class CLIP_base(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.model.eval()
        for p in self.model.parameters():
            p.requires_grad = False

    def forward(
        self,
        input_ids,
        attention_mask,
        pixel_values,
    ):
        # pdb.set_trace()

        out = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                pixel_values=pixel_values,
                output_hidden_states=True,
            )

        logits_per_image = out.logits_per_image

        return {"scores": logits_per_image}

Make sure you are evaluating on the hard list. See scripts/eval.py.