How to evaluate the image difference description?

Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

https://otter-ntu.github.io/

MIT License

3.54k stars 242 forks source link

How to evaluate the image difference description? #283

Open ElegantLin opened 10 months ago

ElegantLin commented 10 months ago

Hi authors,

Thanks for your great repo here. I have checked the eval folder and I wonder whether you have a specific dataset for image difference description since you have two training set for this.

The tag should be evaluation.

Thanks!

Before you open an issue, please check if a similar issue already exists or has been closed before.

When you open an issue, please be sure to include the following

[x] A descriptive title: [xxx] XXXX
[x] A detailed description
[] Assign an issue type tag (label):
- dataset (mimic-it download, usage, etc.),
- demo (online demo), doc (readme, wiki, paper, video etc.),
- evaluation (evaluation result, performance of Otter etc.),
- model (model configuration, components, etc.),
- train (training configuration, process, code, etc.)

Thank you for your contributions!

Luodian commented 10 months ago

oh that depends on how you set it in training, if you choose load the SD and GSD 's two images as in-context examples. The prompt should be <image><image>User: What's the difference of these two images? GPT:<answer> xxxxxx. The vision tensor should be [1, 2, 1, 3, 224, 224] where 2 is in-context dimension.

Then in evaluation, you should do in same way, that works in our experimentation. However, we do not have a released Image model trained on SD and GSD since we find it deteriorates benchmark performance lol.

Another way is to load it as a 2-frames video, then in prompt you should put: <image>User: xxx.

ElegantLin commented 10 months ago

Thanks for your quick response. I will close it after I tried it.

Thanks!

ElegantLin commented 10 months ago

BTW, I understand that the image difference description dataset will greatly hurt the benchmark's performance. Is my understanding correct?

If my understanding is correct, what do you mean by the performance of benchmarks, like image captioning, VQA tasks?

Thanks!

Luodian commented 10 months ago

It will not greatly hurt if using SD, GSD to jointly training with in general image-text instruction tuning datasets.

The performance here includes COCO Caption, MMBench... Sorry I can not reveal too much since we will have a code/paper release recently lol, we will also propose ways to remedy such deterioration.