Baseline CLIP Model - Githubissues

arijitray1993 / COLA

COLA: Evaluate how well your vision-language model can Compose Objects Localized with Attributes!

MIT License

22 stars 0 forks source link

This was the code used for the baseline CLIP:

class CLIP_base(nn.Module):
    def __init__(self, args):
        super().__init__()
        self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
        self.model.eval()
        for p in self.model.parameters():
            p.requires_grad = False

    def forward(
        self,
        input_ids,
        attention_mask,
        pixel_values,
    ):
        # pdb.set_trace()

        out = self.model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                pixel_values=pixel_values,
                output_hidden_states=True,
            )

        logits_per_image = out.logits_per_image

        return {"scores": logits_per_image}

Make sure you are evaluating on the hard list. See scripts/eval.py.

arijitray1993 / COLA

Baseline CLIP Model #4