boschresearch / unetgan

Official Implementation of the paper "A U-Net Based Discriminator for Generative Adversarial Networks" (CVPR 2020)
https://openaccess.thecvf.com/content_CVPR_2020/papers/Schonfeld_A_U-Net_Based_Discriminator_for_Generative_Adversarial_Networks_CVPR_2020_paper.pdf
GNU Affero General Public License v3.0
381 stars 56 forks source link

Hi~I'm confused about why the model needs Consistency Regularization. #2

Closed cs-xiao closed 3 years ago

cs-xiao commented 3 years ago

Hi! Thanks for your great work and I want to try to use it in my work! But, I do not understand why the model needs Consistency Regularization, although I have read the paper carefully. For example, "The per-pixel decision of the well-trained D discriminator should be equivariant under any class-domain- altering transformations of images." in the paper. What is the meaning of any class-domain-altering transformations of images? That is, I do not know what the problems are and what causes the problems, if without Consistency Regularization.

edgarschnfld commented 3 years ago

Hi, Thanks for the questions. Let me try to answer them hereafter:

1. What is the meaning of "any class-domain-altering transformations of images"? The word class refers to the real or fake label in this context (which together constitute two classes). The CutMix transformation is a class-domain-altering transformation of images, because it changes the class of an image if we see real and fake as two classes. In other words, you change the class of the discriminator input from either real or fake to something in-between.

2. Why does this matter? Please note that under the CutMix transformation, the class domain is only altered for the encoder-shaped discriminator loss (since it does binary classification of the whole image into the real or fake class). On the other hand, the decoder of the U-Net discriminator classifies real and fake locally. For this reason, the real/fake classification of the decoder of a well-trained discriminator should not be affected by the CutMix transformation, since the local class-domains remain unaltered. To achieve this, we can enforce the inductive bias that the discriminator should give the same output for a given region, whether or not the surrounding image regions are changed by the CutMix transformation. For this, the following expression should hold true: T( D(R), D(F) ) = D( T(R,F) ) [Eq 1] where D is the decoder output of the discriminator, T is the CutMix Transformation, R is a real image and F is a fake image. Note the general form of T(D(X)) = D(T(X)): in plain words, it means we should get the same result, no matter in which order we apply D and T on the pair of images (R, F). When the order of two transformation results in the same outcome, we call this equivariance. This explains the sentence "The per-pixel decision of the well-trained D discriminator should be equivariant under any class-domain-altering transformations of images."

Finally, to enforce Eq 1, we minimize ||T( D(R), D(F) ) - D( T(R,F) )||^2 [Eq 2] This is the consistency loss. The result is that the decoder-discriminator pays more attention to the local image structures that distinguish real from fake, for all locations, and is not distracted by the global real-fake class domain. Without the consistency regularization, it would be easier for the decoder to take a shortcut and just imitate the encoder predictions or rely on other shortcut features. In effect, you have two discriminators (encoder and decoder) that share information but specialize in different aspects of "realness". Thus, the consistency loss helps to improve the following two aspects: (1) More detailed feedback to the generator through location-specific real/fake evaluation. (2) The discriminator loss is very nuanced and does not saturate easily due to inherent disagreement: after all, it is the average of the encoder loss and multiple local losses that all disagree with each other to some extent. Therefore, the discriminator loss provides informative gradients (do not vanish) for a longer time than BigGAN. This is nicely shown in Figure 8 in the arxiv version (https://arxiv.org/pdf/2002.12655.pdf).

Hope that helps :)

cs-xiao commented 3 years ago

Goog job, man! Now I am understand it. Thank you~