Extract which features in UNet? Which conference will this work be published?

lyf1212 commented 7 months ago

Thank you for your amazing work about controlling Diffusion generation process! But I didn't find any ablation experiments about your choice of the features in UNet, which means a lot to the controlling and quality of results as for me. Are the features in Encoder too noisy to use to train the Condition Adaptor? Can you provide some more experiments? Thanks a lot!

AlonzoLeeeooo commented 7 months ago

Thank you for your amazing work about controlling Diffusion generation process! But I didn't find any ablation experiments about your choice of the features in UNet, which means a lot to the controlling and quality of results as for me. Are the features in Encoder too noisy to use to train the Condition Adaptor? Can you provide some more experiments? Thanks a lot!

Hi @lyf1212,

Thank you for your attention of our work! Unluckily, our paper was rejected by CVPR 2024. Currently, we are working on re-submitting this work to other conferences with more complete experiments. The re-submitted version of our paper will contain the ablation studies of extracted features.

We will update the arXiv paper and this codebase as soon as everything is ready. Please feel free to contact me here if you have any other question of our work.

Best, Chang

lyf1212 commented 7 months ago

I think this idea is novel and shed light on the interpretability of diffusion UNet to some extent. Good luck!

AlonzoLeeeooo commented 7 months ago

I think this idea is novel and shed light on the interpretability of diffusion UNet to some extent. Good luck!

Thank you for your kind words!

AlonzoLeeeooo commented 4 months ago

Hi @lyf1212 ,

Hope this message finds you well! We have conducted the ablation studies of extracted features from U-net, where you can find the qualitative results in the following figure. In this figure, "Enc." and "Dec." represent that we only use the extracted features from the encoder and decoder parts of the U-net, respectively. For the edge condition, it is observed that "Enc." produces results that follow the text prompts, along with inconsistent artifacts (e.g., the tiger is standing in the river). "Dec." generates results that are more consistent, but fail to follow the text prompts. Similar trends are observed in results conditioned on color stroke. Furthermore, one can see that training with incomplete features result in inferior condition-image alignment in the produced images, i.e., color discrepancy in results guided by color stroke.

We conclude two possible insights of the diffusion model, where (1) the encoder part mainly integrates high-level semantics from the text prompts, and (2) the decoder part mainly processes low-level features to maintain the overall consistency of generated images. You can refer to more details in the newest version of our arXiv paper at: https://arxiv.org/pdf/2305.11520. Pre-trained model weights are available at our Huggingface repo and ModelScope repo. Please feel free to have a try.

If you have any further questions about our paper, please feel free to contact me. This issue will be closed.

Best regards

lyf1212 commented 4 months ago

Thank you for your kindly reply~ This qualitative results demonstrate the effectiveness of your design, which utilized all features of UNet to train the Aligner is better. However, some of your words makes me more confused. You argue that "encoder features holds more high-level information while decoder features maintains low level details", probably can not convince me by above ablation. In tiger example, Aligner trained by encoder features generates green background, probably indicates that encoder features also carry low-level informations. In addition, I cannot find any low-level cues indicate "decoder features maintain more low-level information" in the above two cases. Also, maybe there is miss clarify of "Enc." and "Dec.". Do you contain the middle features of UNet, or abolish the middle features in both ablation settings? My origin idea is to discover how "feature depth" influence the training of Aligner, as int the paper "P+: Extended Textual Conditioning in Text-to-Image Generation^1" Fig 7. shows. Besides, as a immature idea, maybe there is a difference of different condition. For example, Canny edge conditions are suitable for low-level features, color strokes or palette maybe suitable for mid-level information, while HED or binary mask conditions are suitable for high level features. By the way, it seems that your updated arXiv paper is a submission to neurips24. The idea is elegant and experiments are sufficient to me. Good luck!

AlonzoLeeeooo commented 4 months ago

Thank you for your kindly reply~ This qualitative results demonstrate the effectiveness of your design, which utilized all features of UNet to train the Aligner is better. However, some of your words makes me more confused. You argue that "encoder features holds more high-level information while decoder features maintains low level details", probably can not convince me by above ablation. In tiger example, Aligner trained by encoder features generates green background, probably indicates that encoder features also carry low-level informations. In addition, I cannot find any low-level cues indicate "decoder features maintain more low-level information" in the above two cases. Also, maybe there is miss clarify of "Enc." and "Dec.". Do you contain the middle features of UNet, or abolish the middle features in both ablation settings? My origin idea is to discover how "feature depth" influence the training of Aligner, as int the paper "P+: Extended Textual Conditioning in Text-to-Image Generation1" Fig 7. shows. Besides, as a immature idea, maybe there is a difference of different condition. For example, Canny edge conditions are suitable for low-level features, color strokes or palette maybe suitable for mid-level information, while HED or binary mask conditions are suitable for high level features. By the way, it seems that your updated arXiv paper is a submission to neurips24. The idea is elegant and experiments are sufficient to me. Good luck!

Footnotes

https://arxiv.org/pdf/2303.09522 ↩

Hi @lyf1212 ,

Thank you for your kind words! I understand you motivation. In both ablation settings, we use the middle features from U-net by default. In fact, we have ablated this and found that the model performs similarly w/ or w/o the middle features. So we follow the default setting.

Thanks for sharing the paper P+: Extended Textual Conditioning in Text-to-Image Generation. They do have similar idea to investigate the impacts of different layer features in diffusion model. Here are some of my two cents, which might not be right: the paper P+ mainly focuses on the personalized image generation task, so the concepts of "different levels" here might differ from the ones of conditional image generation. Since the model mainly interacts with text prompt, which corresponds to semantic meaning of vision features. So we simply define them as high-level features.

For conditional image generation, the process will focus more on spatial information, e.g., edge, shape, and color, which do not have significant semantic meaning. Therefore we define these kinds of image features as low-level vision features.

I am not sure whether it is the definition misalignment causes your misunderstanding. And hope the above explanation address your concern. If you have any further questions about our work, please do not hesitate to contact us again.

Best regards

AlonzoLeeeooo / LCDG

Extract which features in UNet? Which conference will this work be published? #3

Footnotes