dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.34k stars 207 forks source link

Flickr30k Image and Text Retrieval - Query regarding training #60

Open gchhablani opened 2 years ago

gchhablani commented 2 years ago

In this line the answer is being initialized to zeros and never changed. I am not able to understand how this helps with both positive and negative examples.

Can someone please clarify how to use the output from the logit in order perform a pseudo-classification task, i.e. image-text match, or not match from the Flickr30k checkpoint.

mactavish91 commented 1 year ago

In this line the answer is being initialized to zeros and never changed. I am not able to understand how this helps with both positive and negative examples.

Can someone please clarify how to use the output from the logit in order perform a pseudo-classification task, i.e. image-text match, or not match from the Flickr30k checkpoint.

@gchhablani Hi, bro, I'm also confused about this. Do you know why now?

DataminingdidiYR commented 1 year ago

@gchhablani Hi, bro, I'm also confused about this. Do you know why now?