adobe-research / custom-diffusion

Custom Diffusion: Multi-Concept Customization of Text-to-Image Diffusion (CVPR 2023)
https://www.cs.cmu.edu/~custom-diffusion
Other
1.86k stars 138 forks source link

Questions about Dataset and Evaluation #30

Open atjeong opened 1 year ago

atjeong commented 1 year ago

Hello, I am trying to reproduce the results of your paper, but I found that some training images are not available in the link.

Specifically, there are no images for flower, table, and chair concept. Also, there are eight images for the concept of dog and 136 images for the concept of the moongate, but it is different from the paper (Following the Table 5 in the Appendix, it seems that there are ten images for the dog and 135 images for the moongate.)

If it is possible, could I get the the images of flower, table, chair, and full image set for the dog?

Also, I want to ask some details of evaluation for clarity.

1) I think the same generated images with prompts having the modifier token (20 prompts x 50 samples = 1000 images) are used for measuring the metrics of image alignment and text alignment. And the generated image dataset with prompts without the modifier token (20 prompts x 50 samples = 1000 images) is used for KID. Is it right?

2) For measuring KID, how the validation images are retrieved from the dataset LAION-400M? In the document, you mention that "Our results in the paper are not based on the clip-retrieval for retrieving real images as the regularization samples." Then, what is the method you use for the retrieval (for training and evaluation)?

3) For measuring KID, we have 20 prompts per a concept. When you retrieve the images from LAION-400M, did you retrieve the 25 images per a prompt to make the 500 validation images? And when you retrieve the real images, I think the modifier token is removed from the prompts. Is this right?

nupurkmr9 commented 1 year ago

Hi,

Regarding the dog dataset, Table 5 in the Appendix has a typo, and we used only 8 images. I will correct that in the next version of our paper. For moongate images, one of the files in the folder will be with .gif extension, which was removed by file extension check during training. For the flower, table, and chair datasets, we cannot release them because of copyright issues.

For the questions related to evaluation setup:

  1. We used 20 prompts x 50 samples = 1000 images to evaluate for text-alignment and image-alignment in the single concept. In multi-concept, we used 8 prompts x 50 samples = 400 images.

  2. To evaluate overfitting with the KID metric, we collected 500 (image, caption) pairs from LAION-400M. We then generated 1000 images using the 500 captions (2 images per caption). KID is calculated between the 500 real images and 1000 generated images.

  3. To collect the 500 (image, caption) pairs from LAION-400M, I used an internal downloaded dataset of LAION-400M. The criteria was similar to what is used to collect regularization images. For e.g., in the case of the dog model, images with a caption having >0.85 similarity to dog are collected (different from the regularization set). But clip-retrieval should also serve a similar purpose and can be used to collect the validation (image, caption) set.

I hope this clarifies your doubts. Let me know if you need any more details. Thanks!

atjeong commented 1 year ago

Thank you for your kind reply:) It helps me a lot.

Then, for measuring KID metric, you manually calculate the similarity with the clip representations of the texts(caption from the LAION-400M and the "dog")? And then what is the difference from the clip-retrieval?

I saw your code for the retrieval and it uses LAION5B. Is there any reason to use LAION5B instead of LAION-400M? The code : https://github.com/adobe-research/custom-diffusion/blob/main/src/retrieve.py#L13

Thank you!

nupurkmr9 commented 1 year ago

For measuring KID metric, we use the clean-fid library which computes the KID metric given the two folders of images -- (1) Generated 1000 images (2) Real validation set of 500 images.

I added LAION5B as this was also an option in the clip-retrieval. One advantage of LAION-400M might be that it results in English language captions, which might be better for some target concepts. 500 validation images for KID computation should be collected from LAION-400M set to replicate our settings.

Thanks.

atjeong commented 1 year ago

It clarifies my questions and doubts. Thank you very much. :)

nupurkmr9 commented 1 year ago

Hi, I have also updated the data.zip folder with text-files that have links to the images we used in our paper for flower, table, and chair concepts.
Thanks.

atjeong commented 1 year ago

I am happy to hear that. Thanks a lot!