HaozheZhao / UltraEdit

176 stars 9 forks source link

About the Emu Edit Benchmark metrics. #18

Open hanzhn opened 2 months ago

hanzhn commented 2 months ago

I appreciate your excellent work of Instruction-based Editing. Thanks for your efforts!

I have some questions for you about the Emu Edit Benchmark metrics.

  1. What is the specific version of CLIP and DINO to calculate these metrics? I can't find any clue in Emu Edit and yours report.
  2. Have you noticed that the dataset Emu Edit provided in HuggingFace repo: emu_edit_test_set_generations may have mistakenly swapped the test and validation set, of which the record numbers are not aligned with the paper and their another HuggingFace repo: emu_edit_test_set. If this is true, which dataset split should I use for the metrics calculation, and which split did you use for your reported numbers?
hanzhn commented 2 months ago

Or can you point out where I can find authoritative code for these calculation? That will be helpful.

HaozheZhao commented 2 months ago

Hi,

Thank you for bringing these issues to our attention.

Versioning for Metrics Calculation

We've noticed that the original Emu Edit paper and dataset do not specify the versions of CLIP and DINO used. To align with other benchmarks, we adopted the settings used by MagicBrush (GitHub Repository). Specifically, the versions are "ViT-B/32" for CLIP and "dino_vits16" for DINO embeddings. We ensured consistency by rerunning all results in our paper based on the Emu Edit benchmark.

Dataset Splits and Inconsistencies

Regarding the dataset split issue: we utilized the test set of emu_edit_test_set for our evaluations. And due to mistakenly swapped dataset, our reported results were based on the validation set from the emu_edit_test_set_generations.

Also, there are known issues with the benchmark quality as discussed in this discussion thread. Some image-caption pairs seem incorrect, like placeholder captions (e.g., 'a train station in city') or identical source and target captions.

Evaluation Code

For the metrics evaluation, we adhered closely to the MagicBrush evaluation script (GitHub Link) for both benchmarks with no major modifications. We plan to share our refined evaluation code soon; however, in the meantime, you can refer to the provided script in MagicBrush for immediate use.