More diverse formats of media inputs

xk-huang commented 1 year ago

Hi David! Thanks for the wonderful library for VL evaluation!

Here is a potential improvement I want to discuss with you. It is about expanding the format of media inputs. Presently, in clip_recall.py the images are loaded from local path. However, there are use cases where images are not stored in single files but in a huge tar package. I think it could be better to support binary inputs. If so, there are multiple design choices here. e.g., store binary string in the “media_path”, or add an additional field called “media_binary”, etc. I choose the later so far https://github.com/xk-huang/vdtk/commit/afbd47ab7baa5b7f893a717191daa2694c3d96ab I would certainly like to hear your ideas about it.

Besides, I cannot pull all the files stored with git-lfs due to the insufficient quota in the repo. Thus, the tests cannot be thoroughly run.

Thanks in advance!

xk-huang commented 1 year ago

Hope you had a wonderful Fourth of July! I have some additional questions about using CLIP as a metric in your IC3 paper.

Why not directly compare the CLIP similarity scores between image-candidate pairs across all samples?
If there is only one reference (and perhaps one candidate caption), would MRR still be suitable for comparison?

The paper is exceptionally well-written, and the experimental design and results are quite solid. and I am excited to learn more about metric design in the field from your paper!

DavidMChan commented 1 year ago

Thanks! In answer to your questions:

Why not directly compare the CLIP similarity scores between image-candidate pairs across all samples?

We can do this - this is called CLIP score, and measures a per-sample quality estimate of the model. What it doesn't tell you, however, is how discriminative the captions are. i.e. how unique the captions are among the dataset. For example, you could have two similar images, both with the caption "a man is skiing", which might have high CLIP score, but might not be as discriminative as "a man in a red jacket is skiing on one ski" and "a man in a purple dragon costume is skiing down a steep slope".

If there is only one reference (and perhaps one candidate caption), would MRR still be suitable for comparison?

If there is only one reference/candidate, you can still use MRR, since it's focused on how well the candidates match to the images/references, rather than the overall quality. More references will improve the likelihood of matching something, but it won't impact the utility of the measure.

For the potential improvement - I'm certainly not against adding such a field, or extending the media path to provide binary file offsets (i.e. path + offset into a tar file). It should be a pretty simple change (I think).

xk-huang commented 1 year ago

Thank you for your detailed response! It has provided valuable insights into the CLIP score and MRR evaluation methods!

CannyLab / vdtk

More diverse formats of media inputs #8