StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

Abstract

고퀄리티의 이미지 생성
Stacked Generative Adversarial Networks (StackGANs) : 2가지 제안.
1. StackGAN-v1, two-stage 구조를 지닌 GAN, for text-to-image synthesis
  - stage-1 : 주어진 text정보가 있을때, 원초적인 scene의 color/shape을 스케치함. > generating low-resolution image
  - stage-2: 입력값으로, stage1의 결과값과, 다시 text정보가 함께 주어진다. > generating high-resolution image
2. StackGAN-v2, is proposed for both conditional and unconditional generative tasks
  - tree같은 구조의 다수의 generator와 discriminator로 구성
    - 같은 scene을 가진 multiple scale에서의 이미지들은 이 tree 각각의 다른 branch에서 발생.

PRELIMINARIES

기본 GAN
conditional GAN
- extension of GAN
- 기본 GAN에서, additional conditioning variables c를 받는 Case
  - &

STACKGAN-V1: TWO-STAGE GENERATIVE ADVERSARIAL NETWORK

리얼리티를 가진 고해상도 사진을 생성하기 위해서, 2 단계의 GAN 구조를 제안. > StackGAN-v1
다음 그림은 이러한 구조에 대해 보여줌.
- text-to-image 기반의 generative process를 보여줌.
- Stage-I GAN
  - 주어진 text description에 기반한(conditioned), 원초적인 shape과 기본적인 color에 대한 Object를 sketch한다.
  - random noise vector로부터 background를 그린다.
  - 결과적으로 볼때, 이단계에서는 "low-resolution image"를 생성된다.
- Stage-II GAN
  - Stage-I GAN의 결과(generator된 이미지)인 low-resolution image와 또다시 text description를 입력받는다.
  - 결론적으로 최종목적인, 리얼리티를 가진 고해상도 사진이 이 단계에서 생성.

Conditioning Augmentation

Fig. 1에서 보여지듯이, text description t 는 encoder에 의해 encoding되어, text embedding 생성.
- text embedding은 G의 입력값으로, conditioning latent variables(??) 생성하기 위해, nonlinearly transform한다. ??
- 그러나 high-dim(>100)을 가지고 있고, 제한된 데이터로 인해, 우리가 바라는 G의 학습이 불가능(discontinuity in the latent data manifold)
- 그래서, 이 문제의 해결하기 위해 "Conditioning Augmentation"
Conditioning Augmentation
- 이는 conditioning variables 를 생산.
- independent Gaussian distribution 로 부터 random하게
  - : 평균
  - : diagonal covariance matrix
- To further enforce the smoothness over the conditioning manifold and avoid overfitting > add regularization
  - KL divergence > between the standard Gaussian distribution and the conditioning Gaussian distribution.
- 이러한 Conditioning Augmentation에 random하게 하는 이유(랜덤성을 가지게 하면)는, text to image translation 과 다양한 포즈와 모습을 가진 object와 같은 문장이 대응되게 하는 장점이 있음.

Stage-I GAN

text description에 조건된 high-resolution image를 바로 생성하는 대신에, low-resolution image 먼저 생성한다.
그래서, 이 단계는 object에 대해 러브하게 shape과 color를 그리는데 있음.
conditioning variables이 가미된 Stage-I GAN
- 실제 이미지 :
- text description t
- 와 t는 정답데이터 distribution
- z는 주어진 분산 부터 randomly 샘플링된 noise vector
- 는 식 4의 2 term의 균형시키는 regularization parameter
  - 실제 실험에서는 1
- reparameterization trick 이용 ? > ref, "Auto-encoding variational bayes"
  - 그래서, 은 나머지 network에서 jointly하게 학습됨.
To extract a visually-discriminative text embedding of the given description, we follow the approach of Reed et al. [34] to pre-train a text encoder.
- character 레벨에서의 CNN-RNN model
- 이미지들을 가진 text사이에서, 일치된 함수의 학습에 의해 이미지의 공통된 feature space로 text description를 매핑.
Model Architecture
- 를 발생시키기 위해 fully connected layer를 적용
  - 그래서, 각각 두 network의 결합에 의해 에 생성, Fig.1의 "Conditioning Augmentation (CA)"를 보면됨.
    - 는 Ng dimensional conditioning vector > 바로 위의 그림(or Fig.1)
- 이후, 은 dim을 가진 noise vector z와 concat
  - 이는 "up-sampling blocks"의 input을 의미 하고, 이 up-sampling blocks는 W_0xH_0 크기의 이미지를 생성 > 즉, Stage-I의 Generator
- discriminator(D_0)는, Generator에서 생성된 이미지 vs Real 이미지를 비교
  - Down-Sampling 구조를 가지고 있음 > M_0xN_0 (spatial dimension) > image filter map
  - text embedding 는 처음으로 fc layer에 의해 N_d dimension으로 압축.
  - image filter map는 text embedding과 concat하고 1×1 convolutional layer 실행
- 마지막으로, fc layer에 의해 decision score를 최종 output을 생성.

Stage-II GAN

Stage-I GAN의 G에 의해 생성된 Low-resolution image는 리얼리즘등이 떨어짐.
그래서, Stage-II GAN에서는 이 low-resolution image와 다시 Text embedding를 입력받아 conditioned
즉, Stage-I GAN에서 (디테일 부족으로) 무시되었던 Text 정보를 받아서 보다 더 photo-realistic 를 강화시켜 high resolution image를 생성.
low resolution 결과 : 와 Gaussian latent variables 를 conditioning(조건화..하여), Stage-II GAN에서는 D와 G를 식(5)와 식(6)에 의해 학습을 진행.
오리지널 GANs과는 차이가 있음
random noise z는 randomness에 대해 s_0에서 보존되어 있기때문에, 이 stage에서는 적용하지 않음.
Model Architecture
- Stage-II generator > encoder-decoder network with residual blocks
D는 stage1과 거의 유사하나, 다만, input의 이미지들이 좀더 크다.
- stage 1 : 64x64 > stage 2 : 256x256
- to learn better alignment between the image and the conditioning text

Implementation details

The up-sampling blocks consist of the nearest-neighbor upsampling followed by a 3×3 stride 1 convolution.
- 업샘플링할때는 3x3, stride=1 conv사용
- 그런데, nearest-neighbor upsampling에서, nearest-neighbor??? 의미가..
- batch-norm, relu
- 정리하면, The residual blocks = 3×3 stride 1 convolutions, Batch normalization, ReLU.
The down-sampling blocks consist of 4×4 stride 2 convolutions
- batch-norm, LeakyReLU

STACKGAN-V2: MULTI-DISTRIBUTION GENERATIVE ADVERSARIAL NETWORK

StackGAN-v1은 두개의 network로 분리하여 구성 = Stage-I GAN & Stage-II GAN
- for low-to high resolution image distributions.
여기서는 end to end framework > StackGAN-v2,
- to model a series of multi-scale image distributions.
다음 그림은 StackGAN-v2
- 큰 특징은 다수의 G,와 D로 구성
- 이는, Tree 구조로써..
- low-resolution 에서 high-resolution까지 각 tree의 branch에서 generated..
- 각각의 branch에서 보면,
  - 그 branch에서의 scale에서, G는 이미지의 분산을 캡쳐할수 있음.
  - 그때의 D는 그 scaled에서 해당된는 training set으로 부터의 확률을 평가.
G들은 여러개의 distributions를 최적화하는 형태로 jointly하게 학습
이장에서는, two types of multi-distributions에 대해
- multi-scale image distributions
- joint conditional & unconditional image distributions

Multi-scale image distributions approximation

StackGAN-v2 framework 는 tree 구조
input : noise vetor -
- prior distribution : > the standard normal distribution
각 scale에서 G에 의해 발생하는 image
latent variables z는 각 layer에서 hidden feature로 전환
그래서, non-linear transformation에 의해 각 G_i를 위한, hidden features h_i를 계산.
- h_i는 i 번째 branch에서의 hidden feature
- m는 전체 branch 개수
- F_i 는 neural network > Fig.2
noise vector z는 h_i를 계산하기 위한 Fi의 입력값으로써, 이전 branch hidden feature h(i-1)와 concat.
- why? branch에서 생략정 정보를 capture하기 위해
각각의 다른 layer에서의 feature를 가지고 G는 작은 사이즈부터 큰 사이즈() 까지 샘플(G가 만들어낸 이미지를 이렇게 표현)을 generator한다.
- G_i는 i번째 branch에서의 Generator
D와 G > cross-entropy loss > min화~
- x는 real image, s는 G가 만들어낸 이미지.
- 는 i 번째 scale에서 image distribution를 최적화하기 위한 loss function.
StackGAN-v2에서 제안한 동기는 다수의 스케일에서의 data distributions를 모델링할수 있다는 것.
- if any one of those model distributions shares support with the real data distribution at that scale, the overlap could provide good gradient signal to expedite or stabilize training of the whole network at multiple scales.
  - 좋은 gradient signal에 대한 신속한 처리 또는 학습의 안정화..등의 장점.
  - 점점 높은 scale로 가면서, 좀더 디테일한 shape/color/structures등의 처리가 점점 더해져 가면서 학습. > 그래서 최종적으로 high-resolution 이미지 generator.

Joint conditional and unconditional distribution approximation

conditional image generation을 핸들링하기 위해,
- pair(image & conditioning variables)가 매치하는지그렇지 않는지 discriminator가 판단.
  - fake image로 "conditional image distribution"를 가진 이미지 이는, Generator가 발생.
StackGAN-v2에서,
- G를 위해, F_0, F_1의 input으로써, conditioning vector c를 사용하도록 변환됨.
- F를 위해, conditioning vector c는 noise vector z를 대체하여 G가 conditioning variable에 따라 더욱더 디테일한 이미지를 그리도록함.
결론적으로, multi-scale samples은 에 의해 generator함. > 스케일마다 G가 샘플을 생성함.
objective function = unconditional loss & conditional loss
G의 위한 loss function
- The generator Gi at each scale therefore jointly approximates unconditional and conditional image distributions

Color-consistency regularization

color-consistency regularization은 생성된 이미지들의 퀄리티를 향상시킨다.
- 동일한 입력을 받은 서로 다른 G에서 생성된 샘플들을 보다 일관된 색상을 유지하기 위해 도입.
생성된 이미지들은 pixel로써 재표현됨.
- 픽셀들의 mean()/covariance()
  - N은 전체 pixel수
color-consistency regularization은 다른 scale에서의 일관성을 유지하기 위해 mean과 covariance의 차이를 최소하하는 방향으로..
- 은 i번째 G에서 생성된 i번째 샘플에 대한 mean/covariance
- 경험적으로,
- 각 스케일에서의 샘플 이는 m generators of StackGAN-v2
- 은 식(10/12)에서의 i번째 G에 대한 loss function에 Adding.
최종적으로, i번째 G 학습을 위한 Final LoSS
- - 이값이 0이면, text-to-image synthesis task which has a stronger constraint이 필요없다는 의미.

chullhwan-song / Reading-Paper

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks #96

Abstract

PRELIMINARIES

STACKGAN-V1: TWO-STAGE GENERATIVE ADVERSARIAL NETWORK

Conditioning Augmentation

Stage-I GAN

Stage-II GAN

Implementation details

STACKGAN-V2: MULTI-DISTRIBUTION GENERATIVE ADVERSARIAL NETWORK

Multi-scale image distributions approximation

Joint conditional and unconditional distribution approximation

Color-consistency regularization

EXPERIMENTS