Open geknow opened 8 months ago
Hi, can you report your torchmetrics version?
pip list | grep torchmetrics
Hi, can you report your torchmetrics version?
pip list | grep torchmetrics
torchmetrics 1.3.1
Here is my metric result. second column is avg_score
Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug!
You can do
pip install -e .
To update the ImagenHub library.
Our result with the latest code with ImagenHub:
dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney'])
DALLE2 | Inferencing
====> CLIPScore | Avg: 0.2712
DALLE3 | Inferencing
====> CLIPScore | Avg: 0.2697
DeepFloydIF | Inferencing
====> CLIPScore | Avg: 0.2814
Kandinsky | Inferencing
====> CLIPScore | Avg: 0.2838
LCM | Inferencing
====> CLIPScore | Avg: 0.2681
Midjourney | Inferencing
====> CLIPScore | Avg: 0.2839
OpenJourney | Inferencing
====> CLIPScore | Avg: 0.2791
PixArtAlpha | Inferencing
====> CLIPScore | Avg: 0.2822
PlayGroundV2 | Inferencing
====> CLIPScore | Avg: 0.2883
SD | Inferencing
====> CLIPScore | Avg: 0.2899
SDXL | Inferencing
====> CLIPScore | Avg: 0.2886
SDXLLightning | Inferencing
====> CLIPScore | Avg: 0.2863
SDXLTurbo | Inferencing
====> CLIPScore | Avg: 0.2889
SSD | Inferencing
====> CLIPScore | Avg: 0.2911
StableCascade | Inferencing
====> CLIPScore | Avg: 0.291
StableUnCLIP | Inferencing
====> CLIPScore | Avg: 0.2657
UniDiffuser | Inferencing
====> CLIPScore | Avg: 0.2674
Wuerstchen | Inferencing
====> CLIPScore | Avg: 0.2868
Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug!
You can do
pip install -e .
To update the ImagenHub library.
Our result with the latest code with ImagenHub:
dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney']) DALLE2 | Inferencing ====> CLIPScore | Avg: 0.2712 DALLE3 | Inferencing ====> CLIPScore | Avg: 0.2697 DeepFloydIF | Inferencing ====> CLIPScore | Avg: 0.2814 Kandinsky | Inferencing ====> CLIPScore | Avg: 0.2838 LCM | Inferencing ====> CLIPScore | Avg: 0.2681 Midjourney | Inferencing ====> CLIPScore | Avg: 0.2839 OpenJourney | Inferencing ====> CLIPScore | Avg: 0.2791 PixArtAlpha | Inferencing ====> CLIPScore | Avg: 0.2822 PlayGroundV2 | Inferencing ====> CLIPScore | Avg: 0.2883 SD | Inferencing ====> CLIPScore | Avg: 0.2899 SDXL | Inferencing ====> CLIPScore | Avg: 0.2886 SDXLLightning | Inferencing ====> CLIPScore | Avg: 0.2863 SDXLTurbo | Inferencing ====> CLIPScore | Avg: 0.2889 SSD | Inferencing ====> CLIPScore | Avg: 0.2911 StableCascade | Inferencing ====> CLIPScore | Avg: 0.291 StableUnCLIP | Inferencing ====> CLIPScore | Avg: 0.2657 UniDiffuser | Inferencing ====> CLIPScore | Avg: 0.2674 Wuerstchen | Inferencing ====> CLIPScore | Avg: 0.2868
Could you please also check the ImageReward indicator? It seems odd that UniDiffuser is negative number.
The Clipscore indicator is still a little different. Very tiny
Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug!
You can do
pip install -e .
To update the ImagenHub library.
Our result with the latest code with ImagenHub:
dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney']) DALLE2 | Inferencing ====> CLIPScore | Avg: 0.2712 DALLE3 | Inferencing ====> CLIPScore | Avg: 0.2697 DeepFloydIF | Inferencing ====> CLIPScore | Avg: 0.2814 Kandinsky | Inferencing ====> CLIPScore | Avg: 0.2838 LCM | Inferencing ====> CLIPScore | Avg: 0.2681 Midjourney | Inferencing ====> CLIPScore | Avg: 0.2839 OpenJourney | Inferencing ====> CLIPScore | Avg: 0.2791 PixArtAlpha | Inferencing ====> CLIPScore | Avg: 0.2822 PlayGroundV2 | Inferencing ====> CLIPScore | Avg: 0.2883 SD | Inferencing ====> CLIPScore | Avg: 0.2899 SDXL | Inferencing ====> CLIPScore | Avg: 0.2886 SDXLLightning | Inferencing ====> CLIPScore | Avg: 0.2863 SDXLTurbo | Inferencing ====> CLIPScore | Avg: 0.2889 SSD | Inferencing ====> CLIPScore | Avg: 0.2911 StableCascade | Inferencing ====> CLIPScore | Avg: 0.291 StableUnCLIP | Inferencing ====> CLIPScore | Avg: 0.2657 UniDiffuser | Inferencing ====> CLIPScore | Avg: 0.2674 Wuerstchen | Inferencing ====> CLIPScore | Avg: 0.2868
maybe change the code to below?
# generated_image = (generated_image * 255).astype("uint8")
generated_image = generated_image.astype("uint8")
or your original code change to
transform = transforms.Compose([
transforms.ToTensor(),
])
generated_image = (**transform(generated_image) * 255**).unsqueeze(0).float().to(self.device)
clip_score = self.model(generated_image, prompt).detach()
Just realized the code implementation in dev branch (where we run the experiment) is different. Now it is aligned with our paper. Thanks for catching this bug! You can do
pip install -e .
To update the ImagenHub library. Our result with the latest code with ImagenHub:
dict_keys(['PixArtAlpha', 'SDXL', 'SDXLTurbo', 'DALLE3', 'SD', 'Wuerstchen', 'SDXLLightning', 'DeepFloydIF', 'OpenJourney', 'StableCascade', 'LCM', 'StableUnCLIP', 'UniDiffuser', 'DALLE2', 'Kandinsky', 'PlayGroundV2', 'SSD', 'Midjourney']) DALLE2 | Inferencing ====> CLIPScore | Avg: 0.2712 DALLE3 | Inferencing ====> CLIPScore | Avg: 0.2697 DeepFloydIF | Inferencing ====> CLIPScore | Avg: 0.2814 Kandinsky | Inferencing ====> CLIPScore | Avg: 0.2838 LCM | Inferencing ====> CLIPScore | Avg: 0.2681 Midjourney | Inferencing ====> CLIPScore | Avg: 0.2839 OpenJourney | Inferencing ====> CLIPScore | Avg: 0.2791 PixArtAlpha | Inferencing ====> CLIPScore | Avg: 0.2822 PlayGroundV2 | Inferencing ====> CLIPScore | Avg: 0.2883 SD | Inferencing ====> CLIPScore | Avg: 0.2899 SDXL | Inferencing ====> CLIPScore | Avg: 0.2886 SDXLLightning | Inferencing ====> CLIPScore | Avg: 0.2863 SDXLTurbo | Inferencing ====> CLIPScore | Avg: 0.2889 SSD | Inferencing ====> CLIPScore | Avg: 0.2911 StableCascade | Inferencing ====> CLIPScore | Avg: 0.291 StableUnCLIP | Inferencing ====> CLIPScore | Avg: 0.2657 UniDiffuser | Inferencing ====> CLIPScore | Avg: 0.2674 Wuerstchen | Inferencing ====> CLIPScore | Avg: 0.2868
Could you please also check the ImageReward indicator? It seems odd that UniDiffuser is negative number.
Hi, can you open another issue to discuss the ImageReward issue?
Describe the bug During the evaluation of Text-guided Image Generation, the ClipScore generated is not consistent with the results documented in the paper. I have compared 197 images; they are identical to the ones shown on Visualize ImagenHub_Text-Guided_IG (chromaica.github.io). However, the score generated by using ClipScore for these clones is different.
To Reproduce my metrics code as below
Expected behavior Same score as arxiv preset.