RitaRamo / smallcap

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation
88 stars 17 forks source link

About data leakeage when using COCO captions as datastore #7

Closed David-Zeng-Zijian closed 1 year ago

David-Zeng-Zijian commented 1 year ago

Thanks for your excellent work, I just wonder when infering on COCO using COCO captions as datastore to prompt, the prompt may include the ground-truth captions, which may lead the data-leakage question. Have you adopted some strategies to avoid this question?

YovaKem commented 1 year ago

Hi! We made sure to avoid leakage by (a) populating the datastore with only training captions, which means that when doing inference on validation and test samples there is no risk of leakage (see here ), and (b) filtering out ground-truth captions for training samples at training time so the model doesn't learn to just copy the retrieved caption (see here ).