DavidHuji / CapDec

CapDec: SOTA Zero Shot Image Captioning Using CLIP and GPT2, EMNLP 2022 (findings)
MIT License
185 stars 19 forks source link

About the variance of 0.016 #3

Open 232525 opened 2 years ago

232525 commented 2 years ago

Hi, authors. In your paper, you mentioned:

Specifically, we set $\epsilon$ to the mean $\mathcal{l}_\infty$ norm of embedding differences between five captions that correspond to the same image. We estimated this based on captions of only 15 MS-COCO images.

I would like to know, have you released the corresponding code detail of the estimate of the variance?

DavidHuji commented 2 years ago

Hi, Yes. We calculate it using the predictions_runner.py script with the flag --ablation_dist and the flag --text_autoencoder (which specifies to use CLIP text encoder rather than an image encoder. see its main method there: calc_distances_of_ready_embeddings). This will print the results of the average max norm of 900 samples but it will also save the 900 values to a pickle file so you can make sure this estimation could be done also with only 15 samples with high confidence (i.e. the variance of the max norm of different sets is negligible).

232525 commented 2 years ago

Thanks for your reply, I calc the average result on val set, and it seems normal. But something confused me: you mentioned $l_\infty$ norm in your paper, but it seems that the $l1$ norm result is adopted. Is there anything I missed or mistake? image

DavidHuji commented 2 years ago

It prints a few different metrics though the infinity norm is the one that is printed here (line 86).