eladrich / pixel2style2pixel

Official Implementation for "Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation" (CVPR 2021) presenting the pixel2style2pixel (pSp) framework
https://eladrich.github.io/pixel2style2pixel/
MIT License
3.2k stars 568 forks source link

Why are the coarse details determined from the larger blocks? #299

Closed Richienb closed 1 year ago

Richienb commented 2 years ago

In a ResNet, the coarser blocks come first:

https://github.com/eladrich/pixel2style2pixel/blob/361117156fc4eb90f463a1ca71eaf7f80d573e67/models/encoders/helpers.py#L32-L35

So why do the coarse style blocks use the fine resnet blocks?

https://github.com/eladrich/pixel2style2pixel/blob/361117156fc4eb90f463a1ca71eaf7f80d573e67/models/encoders/psp_encoders.py#L95-L105

In the video that was provided, each sample has randomness introduced through replacing the fine stylegan input latents with the random noise. This means the difference between all of the images is the fine layer. It is observed that skin tone is from the fine style layer and the facial features are from the coarse style layer. Is that meant to happen?

https://user-images.githubusercontent.com/29491356/203987089-62e51315-85b4-44f3-8ea6-77e293e9ea2c.mp4

Richienb commented 2 years ago

I believe this is because larger blocks have more space to store the same information that is stored in the smaller blocks. That must mean larger blocks end up storing coarser details and smaller blocks end up storing finer details.

Richienb commented 2 years ago

Perhaps this is also because the smaller blocks have had less convolutions applied and thus contain much more detail.

yuval-alaluf commented 2 years ago

What @Richienb is correct. Basically, the coarse ResNet layers (i.e., the early layers) have gone through less processing and store finer details such as colors and texture. Hence these layers are related to the fine StyleGAN layers. In contrast, the fine ResNet layers (i.e., those at the end of the layer) have gone through a lot of processing and store more semantic information. As such, they are related to the coarse and medium SG layers. Hope this helps.