A Style-Based Generator Architecture for Generative Adversarial Networks

Abstract

이 논문의 리뷰는 논문과 Post를 참고
- Style-based GANs – Generating and Tuning Realistic Artificial Faces
저자는 * Progressive Growing of GANs for Improved Quality, Stability, and Variation : [paper][review]의 저자임. 따라서, 읽다보면, based paper가 Progress GAN(ProGAN)임 알수 있음.
ProGAN은 1024x1024크기의 고퀄리티를 Generator한다. 이는
- progressive training 을 통해서,
  - G와 D에서, 초기에는 매우 낮은 해상도의 이미지 (예 : 4x4)로 학습을 시작하여 점점 더 높은 해상도를 가지는 layer들을 추가하는 방식이다.
  - 점점 layer가 증가할수록 고급(?)된 또는 detail한 attributes들 학습해 나간다.

ProGAN은 고품질을 생성하는데는 좋은 결과를 모델을 가지지만, 이미지의 특정 attribute를 control하는 것은 매우 어려운 단점이 있다.(사실 대부분 모델이 그렇다.) 즉, 하나의 attribute를 control하기는 것은 그 하나만 문제가 되는 것이 아니라(1:1로 매핑되어 attribute는 control할 수 있다면 좋지만..), 종합적으로 attributes가 연결되어 있기 때문.(얽혀있다는 의미)

StyleGAN은 이러한 위의 문제에 대한 해결 제시.

StyleGAN

ProGAN의 업그레이드 버전
ProGAN의 각 layer는 독특한 특징을 있는데, 이는 곧 이점(potential benefit)으로 다가온다. 즉, 각각의 layer에서 각기 다른 visual feature들을 control 할수 있다는 것.
낮은 layer일수록 영향을 주는 feature가 커진다.
이 연구에서는 이러한 layer들을 3가지로 divide
- Coarse – resolution of up to 8^2 – affects pose, general hair style, face shape, etc
- Middle – resolution of 162 to 32^2 – affects finer facial features, hair style, eyes open/closed, etc.
- Fine – resolution of 642 to 1024^2 – affects color scheme (eye, hair and skin) and micro features.
  Mapping Network
이 단계의 목표는, 입력벡터를 다른 비주얼 속성을 control하기 위해 intermediate vector로 encode하는 것.
- 초기에 보통 random latent vector를 집어 넣는데. 이를 대신할 만한 network가 바로 "Mapping Network" 란 용어로 쓰는듯..
- ProGAN의 단점에서 언급한거와 같이 하나의 속성을 control하기 위해서는 단지 하나의 input값으로 control되어 있음 좋은데..실제적으로는 여러가지 속성과 얽혀있음(feature entanglement).
- 하지만, 여기서 같은 아닌 또하나의 network(여기서 말하는게 mapping network인듯)을 사용하면, raining data distribution을 따르지 않고 feature간의 상관관계를 줄일수 있는 vector를 생성할 수 있음.
- W의 512차원의 output vector(512x1)
- 8-layer MLP로 구성

Style-based generator == Style Modules (AdaIN)

"Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization" 의 연구에서 영감을 얻음.
Mapping Network에 생성된 W를 입력으로 받음.
이 AdaIN 모듈은 각 합성 Network의 각 layer(each resolution level)마다 중간에 삽입되는 형태
각 레벨마다의 visual expression의 속성들을 정의(defines the visual expression of the features in that level)
- CNN layer의 결과의 각채널은 scaling과 3단계의 shifting이 예상된 결과를 가지도록 만들어지도록 처음에 normalize한다.
  - Each channel of the convolution layer output is first normalized to make sure the scaling and shifting of step 3 have the expected effect.
- 중간벡터(intermediate vector:mapping network의 output) W는 각 채널을 위해 또하나의 fully-connected layer를 이용하여 scale과 bias로 transform한다.
  - The intermediate vector w is transformed using another fully-connected layer (marked as A) into a scale and bias for each channel.
- scale과 bias vector는 convolution 결과의 각 채널로 shift(이동)시킨다. 그러면, convolution에서 각 filter의 중요성을 정의한다. 이 튜닝은 W에서 visual representation(시각적 표현)까지 정보를 전환(변환)한다.
  - The scale and bias vectors shift each channel of the convolution output, thereby defining the importance of each filter in the convolution. This tuning translates the information from w to a visual representation.
이 단계는 다음 그림을 보면 더 설명이 명확해짐. (그림의 오른쪽)
- The generator’s Adaptive Instance Normalization (AdaIN)
paper의 그림

Removing traditional input

밑에와 같이 합성 네트워크의 input에 관한 설명.
대부분의 ProGAN을 포함한 대부분의 GAN모델에서는 G의 초기값을 random하게 초기화하여 사용.
- the input of the 4×4 level
StyleGAN에서는 W와 AdaIN에 의해 control할 수 있으므로, 초기값은 생략하고 이는 Constant(4x4x512)로 대체
- 논문에서는 얼마나 성능이 향상되는지 설명안함.
  - 그냥 random init를 사용해도 되지 않나?? 상수나 random 이나..? 상수가 어떠한 define되거나 계산된 수가 아니라면,
- 다만, 얽힘(entanglement)을 줄어들게 하는 방법이라고 예측만..
  Stochastic variation
확률적 변화(변이) ?
얼굴의 다양한 특징을 나타내는 속성(주근깨, 주름, 머리카락...등등)을 좀더 사실적인 이미지로 만들고, 결과의 다양성을 증가시킨것..이러한 사람들의 얼굴 측면을 볼때 작고 확률적으로 볼 수 있다.
- There are many aspects in people’s faces that are small and can be seen as stochastic, such as freckles, exact placement of hairs, wrinkles, features which make the image more realistic and increase the variety of outputs.
GAN의 입력에 공통적으로 작은 feature들을 삽입시키는 방법은 noise를 넣는것.
그러나 많은 경우와 위에서 설명했듯이 feature 얽힘 현상 문제때문에 noise를 효과적으로 제하는 것이 어려워서 이는 다른 기능에도 영향을 가게 만듦.
그래서, StyleGAN에서 noise를 삽입하는 방식은 AdaIN mechanism 유사.
- 스케일된 noise가 AdaIN 모듈전에 각 채널에 add.
- resolution level의 feature의 비주얼적인 표현을 조금 변경하도록 작동.
  - A scaled noise is added to each channel before the AdaIN module and changes a bit the visual expression of the features of the resolution level it operates on.
- 밑의 그림을 보면 좀 더 명확한데 src를 좀더 봐야할듯..
  Style mixing - 이부분은 좀더 이해가 필요
To further encourage the styles to localize > "mixing regularization"
StyleGAN 의 G는 intermediate vector w를 각 level에서 입력값으로 이용한다. 위의 많은 그림들속에서 알수 있음.
이에 따라 각 레벨에서 network가 correlation이 있다는 것을 알 수 있다고 함.
이러한 correlation을 줄이기 위해서,
- 이 모델은 random하게 2개의 input vector(???)를 선택. > two random latent codes
- 그리고, intermediate vector w를 생성한다.
- 결과적으로 볼때, 두개의 random vector z1, z2 > intermediate vector w1, w2를 생성 즉, 어렵게 설명하고 있지만, 두번 "Mapping Network"를 돌린다.란 의미인듯.
그래서, 처번째 레벨을 학습을 시키고, 다른 나머지 레벨을 학습하기 위해 다른 레벨(in a random point) 전환. ???
랜덤 스위치는 네트워크가 학습하지 않고 레벨 간의 correlation에 의존하도록함.
- It then trains some of the levels with the first and switches (in a random point) to the other to train the rest of the levels. The random switch ensures that the network won’t learn and rely on a correlation between levels.
- paper : When generating such an image, we simply switch from one latent code to another — an operation we refer to as style mixing— at a randomly selected point in the synthesis network.
다시말해서, mixing regularization는 가까운(인접) style이 서로 관련(correlation)이 없다고 학습
동영상 링크 : https://www.youtube.com/watch?time_continue=65&v=kSLJriaOumA

Truncation trick in W

G에서의 또하나의 문제점은 잘못된 학습셋을 다루는 영역에서의 문제점.
- 모 이러한 것은 잘못된 학습을 이끈다.
- 실제로 이러한 문제를 해결하는 방법에는 shrunk or Truncation 이란 방벙이 이미 알려져 있음
  - 이러한 방법은 평균이미지의 quality를 향상시키지만, variant가 적어짐.
StyleGAN에서는 이러한 문제를 해결하기 위해 W를 Truncation한다(자른다..?).
- intermediate vector w는 보통 평균값을 유지하도록 한다란 의미인듯.
- 평규값 W 기준으로 미세조정하는듯..
동영상 링크 : https://www.youtube.com/watch?time_continue=243&v=kSLJriaOumA
Fine-tuning
ProGAN기반의 StyleGAN 개선은
- up/downscaling을 bi-linear로 대체
- 몇가지 hyperparameters 업데이트
논문의 부록 C 참고