Open vishwa27yvs opened 3 hours ago
This was the code used for the baseline CLIP:
class CLIP_base(nn.Module):
def __init__(self, args):
super().__init__()
self.model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
self.model.eval()
for p in self.model.parameters():
p.requires_grad = False
def forward(
self,
input_ids,
attention_mask,
pixel_values,
):
# pdb.set_trace()
out = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
pixel_values=pixel_values,
output_hidden_states=True,
)
logits_per_image = out.logits_per_image
return {"scores": logits_per_image}
Make sure you are evaluating on the hard list. See scripts/eval.py
.
I am trying to reproduce the Baseline CLIP results for Single object GQA setting. I am getting a much lower mAP of 0.18, which does not match the paper's numbers, I am using the pooled output of CLIP's text and image encoder, does this match the paper's implementation. I am using
openai/clip-vit-base-patch32