different clipScore - Githubissues

geknow commented 4 months ago

Describe the bug During the evaluation of Text-guided Image Generation, the ClipScore generated is not consistent with the results documented in the paper. I have compared 197 images; they are identical to the ones shown on Visualize ImagenHub_Text-Guided_IG (chromaica.github.io). However, the score generated by using ClipScore for these clones is different.

To Reproduce my metrics code as below

from imagen_hub.metrics import MetricCLIPScore
from imagen_hub.utils import load_image

import pandas as pd
import json
import os

base_path = 'ImagenHub_Text-Guided_IG'
with open(f'{base_path}/dataset_lookup.json') as f:
    dataset = json.loads(f.read())

df = pd.DataFrame(columns=['model', 'avg_score', 'indicator'])

for indicator in ['MetricCLIPScore']:
    model = MetricCLIPScore()

    for model_name in os.listdir(base_path):

        list_real_images = []
        list_prompts = []
        for key, value in dataset.items():
            generated_image = load_image(f"{base_path}/{model_name}/{key}")
            list_real_images.append(generated_image)
            list_prompts.append(value['prompt'])

        all_score = [model.evaluate(x, y) for (x, y) in zip(list_real_images, list_prompts)]
        df = df.append({'model': model_name, 'avg_score': sum(all_score) / len(dataset), 'indicator': indicator}, ignore_index=True)
        print(model_name, " ====> avg Score : ", sum(all_score) / len(dataset))

Expected behavior Same score as arxiv preset.

vinesmsuic commented 3 months ago

Hi, can you report your torchmetrics version?

pip list | grep torchmetrics

geknow commented 3 months ago

Hi, can you report your torchmetrics version?
pip list | grep torchmetrics

torchmetrics 1.3.1

geknow commented 3 months ago

Here is my metric result. second column is avg_score

vinesmsuic commented 3 months ago

Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug!

You can do

pip install -e .

To update the ImagenHub library.

Our result with the latest code with ImagenHub:

dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney'])
DALLE2 | Inferencing
====> CLIPScore | Avg:  0.2712
DALLE3 | Inferencing
====> CLIPScore | Avg:  0.2697
DeepFloydIF | Inferencing
====> CLIPScore | Avg:  0.2814
Kandinsky | Inferencing
====> CLIPScore | Avg:  0.2838
LCM | Inferencing
====> CLIPScore | Avg:  0.2681
Midjourney | Inferencing
====> CLIPScore | Avg:  0.2839
OpenJourney | Inferencing
====> CLIPScore | Avg:  0.2791
PixArtAlpha | Inferencing
====> CLIPScore | Avg:  0.2822
PlayGroundV2 | Inferencing
====> CLIPScore | Avg:  0.2883
SD | Inferencing
====> CLIPScore | Avg:  0.2899
SDXL | Inferencing
====> CLIPScore | Avg:  0.2886
SDXLLightning | Inferencing
====> CLIPScore | Avg:  0.2863
SDXLTurbo | Inferencing
====> CLIPScore | Avg:  0.2889
SSD | Inferencing
====> CLIPScore | Avg:  0.2911
StableCascade | Inferencing
====> CLIPScore | Avg:  0.291
StableUnCLIP | Inferencing
====> CLIPScore | Avg:  0.2657
UniDiffuser | Inferencing
====> CLIPScore | Avg:  0.2674
Wuerstchen | Inferencing
====> CLIPScore | Avg:  0.2868

geknow commented 3 months ago

Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug!

You can do

pip install -e .

To update the ImagenHub library.

Our result with the latest code with ImagenHub:

dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney'])
DALLE2 | Inferencing
====> CLIPScore | Avg:  0.2712
DALLE3 | Inferencing
====> CLIPScore | Avg:  0.2697
DeepFloydIF | Inferencing
====> CLIPScore | Avg:  0.2814
Kandinsky | Inferencing
====> CLIPScore | Avg:  0.2838
LCM | Inferencing
====> CLIPScore | Avg:  0.2681
Midjourney | Inferencing
====> CLIPScore | Avg:  0.2839
OpenJourney | Inferencing
====> CLIPScore | Avg:  0.2791
PixArtAlpha | Inferencing
====> CLIPScore | Avg:  0.2822
PlayGroundV2 | Inferencing
====> CLIPScore | Avg:  0.2883
SD | Inferencing
====> CLIPScore | Avg:  0.2899
SDXL | Inferencing
====> CLIPScore | Avg:  0.2886
SDXLLightning | Inferencing
====> CLIPScore | Avg:  0.2863
SDXLTurbo | Inferencing
====> CLIPScore | Avg:  0.2889
SSD | Inferencing
====> CLIPScore | Avg:  0.2911
StableCascade | Inferencing
====> CLIPScore | Avg:  0.291
StableUnCLIP | Inferencing
====> CLIPScore | Avg:  0.2657
UniDiffuser | Inferencing
====> CLIPScore | Avg:  0.2674
Wuerstchen | Inferencing
====> CLIPScore | Avg:  0.2868

Could you please also check the ImageReward indicator? It seems odd that UniDiffuser is negative number.

geknow commented 3 months ago

The Clipscore indicator is still a little different. Very tiny

geknow commented 3 months ago

Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug!

You can do

pip install -e .

To update the ImagenHub library.

Our result with the latest code with ImagenHub:

dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney'])
DALLE2 | Inferencing
====> CLIPScore | Avg:  0.2712
DALLE3 | Inferencing
====> CLIPScore | Avg:  0.2697
DeepFloydIF | Inferencing
====> CLIPScore | Avg:  0.2814
Kandinsky | Inferencing
====> CLIPScore | Avg:  0.2838
LCM | Inferencing
====> CLIPScore | Avg:  0.2681
Midjourney | Inferencing
====> CLIPScore | Avg:  0.2839
OpenJourney | Inferencing
====> CLIPScore | Avg:  0.2791
PixArtAlpha | Inferencing
====> CLIPScore | Avg:  0.2822
PlayGroundV2 | Inferencing
====> CLIPScore | Avg:  0.2883
SD | Inferencing
====> CLIPScore | Avg:  0.2899
SDXL | Inferencing
====> CLIPScore | Avg:  0.2886
SDXLLightning | Inferencing
====> CLIPScore | Avg:  0.2863
SDXLTurbo | Inferencing
====> CLIPScore | Avg:  0.2889
SSD | Inferencing
====> CLIPScore | Avg:  0.2911
StableCascade | Inferencing
====> CLIPScore | Avg:  0.291
StableUnCLIP | Inferencing
====> CLIPScore | Avg:  0.2657
UniDiffuser | Inferencing
====> CLIPScore | Avg:  0.2674
Wuerstchen | Inferencing
====> CLIPScore | Avg:  0.2868

maybe change the code to below?

# generated_image = (generated_image * 255).astype("uint8")
generated_image = generated_image.astype("uint8")

or your original code change to

transform = transforms.Compose([
        transforms.ToTensor(),
    ])
    generated_image = (**transform(generated_image) * 255**).unsqueeze(0).float().to(self.device)
    clip_score = self.model(generated_image, prompt).detach()

vinesmsuic commented 3 months ago

Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug! You can do

pip install -e .

To update the ImagenHub library. Our result with the latest code with ImagenHub:

dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney'])
DALLE2 | Inferencing
====> CLIPScore | Avg:  0.2712
DALLE3 | Inferencing
====> CLIPScore | Avg:  0.2697
DeepFloydIF | Inferencing
====> CLIPScore | Avg:  0.2814
Kandinsky | Inferencing
====> CLIPScore | Avg:  0.2838
LCM | Inferencing
====> CLIPScore | Avg:  0.2681
Midjourney | Inferencing
====> CLIPScore | Avg:  0.2839
OpenJourney | Inferencing
====> CLIPScore | Avg:  0.2791
PixArtAlpha | Inferencing
====> CLIPScore | Avg:  0.2822
PlayGroundV2 | Inferencing
====> CLIPScore | Avg:  0.2883
SD | Inferencing
====> CLIPScore | Avg:  0.2899
SDXL | Inferencing
====> CLIPScore | Avg:  0.2886
SDXLLightning | Inferencing
====> CLIPScore | Avg:  0.2863
SDXLTurbo | Inferencing
====> CLIPScore | Avg:  0.2889
SSD | Inferencing
====> CLIPScore | Avg:  0.2911
StableCascade | Inferencing
====> CLIPScore | Avg:  0.291
StableUnCLIP | Inferencing
====> CLIPScore | Avg:  0.2657
UniDiffuser | Inferencing
====> CLIPScore | Avg:  0.2674
Wuerstchen | Inferencing
====> CLIPScore | Avg:  0.2868

Could you please also check the ImageReward indicator? It seems odd that UniDiffuser is negative number.

Hi, can you open another issue to discuss the ImageReward issue?

TIGER-AI-Lab / ImagenHub

different clipScore #16