본 논문에서는 latent space를 다양하게 비교하기 위해서 PGGAN, styleGAN을 참조하였다. 여기서는 styleGAN만 살펴보기로 한다.

InterFaceGAN - Interpreting the Latent Space of GANs for Semantic Face Editing

Abstract

Image Generator로 GAN이라는 도구가 존재하지만, latent representation에 대한 이해는 여전히 부족하다. 그래서 본 논문에서는 disentangled face representation을 해석하고, facial semantics를 latent space로 encode를 설명하는 InterFaceGAN famework를 소개할 것이다.

GAN은 여러 semantic 속성의 latent space를 가지고 있다. latent space에는 몇개의 linear subspace이 존재한다.
각 subspace를 알아보고 얼굴의 attribute를 조정한다.(w/o retraining model)
하나의 semestic을 조정할 때 다른 semestic에 영향을 가지 않도록 하는 방법을 자세히 알아본다.(attribute types : gender, age, expression, eyeglasses)

1. Introduction

2. Related Work

Generative Adversarial Networks

Study on Letent Sapce of GANs

Semantic Face Editing with GANs

GAN Inversion

Image-to-Image Translation

3. Framework of InterFaceGAN

latent representation의 semantics 속성을 조사
identified semantics pipeline for face editing을 작성

3.1. Semantics in Latent Space

latent vector에서 image를 만들어 내고 얼마나 잘 만들어졌는지 확인하기 위하여 score function을 이용한다.

GAN model은 Z latent space 에서 X image space로 바꿔준다.(벡터를 이미지로 만들어준다.) g : Z -> X Z : d -dimensional latent space로 되어있으며, 보통 Gaussian distribution N(0,I_{d})을 이용한다. X : image space, 각 x 요소들은 semantic imformation을 가지고 있다.( gender, age... for face model)

f{s} : X -> S scoring function : f{} X image를 m semantic space에 embedding한다.

s를 얻기 위한 수식 s = f_{s}(g(z))

Single Semantic

한 가지 특성의 경우 semantic 정도를 알기 위하여 hyperplane과 vector간 거리를 계산한다.

hyperplane hyperplane은 latent space에서 binary semantic의 기준 면이 되는 지점을 의미한다. (예시로, hyperplane을 넘어가지 않으면 male hyperplane을 넘어가면 female의 semantic 값을 가지게 되는 것)

linearly interpolation한 2개의 latent code가 있다고 가정한다.(z1, z2) 이 latent code들은 연속적인 attribute 변화를 가져온다. ex) z1에서 z2로 변할 시, 안경 attribute의 변화만 가져오게 되는 것이다.(안경 x -> 안경 o)

벡터의 semantic 정도를 표현하기 위해서 hyperplane을 기준으로 distance를 이용한다.

Multiple Semantics

다양한 특성을 가진 경우 semantic score의 변화.

만약 m different semantic인 경우 s는 m개의 vector가 된다. 1차원에서 m차원으로 데이터가 증가한다 하여도 matrix를 이용하여 한번에 계산이 가능하다.

논문에서는 하나의 semantic을 조정하였을 때 다른 semantic 값에 영향을 미치지 않도록 조절하는 것이 목적이다. 이를 위해서는 각 semantic hyperplane이 entangle한 관계로 있어야 한다. (orthogonal한 hyperplane이 요구된다.)

3.2. Manipulation in Latent Space

image editing을 위해서 altent space안의 semantics을 찾아 사용하는 방법

Single Attribute Manipulation

하나의 hyperplane이 존재하는 경우이다. semantic을 hyperplane distance를 수정하여 조절한다.

Conditional Manipulation

여러 attribute들이 있는 경우이다. 하나의 semantic을 조절할 때 다른 semantic에 영향을 주지 않고 조절해야 한다.

방법은 다음과 같다. latent vector를 n2를 수직벡터로 가지는 plane과의 distance는 일정하게 하되, n1 수직벡터 plane distance만 변하도록 이동시킨다.

Real Image Manipulation

InterfaceGAN은 GAN model의 latent space에서 semantic editing을 가능하게 한다. 방법은 GAN inversion을 이용하여 실제 이미지 vector에서 목적 이미지 vector로 재구성하는 과정을 거친다.

3.3. Implementation Details

본 논문에서는 5종류의 attribute를 선정한다. pose, smile(expression), age, gender, and eyeglasses. 이 파트에서는 face editing을 위한 전체적인 pipeline을 설명한다.

1) synthesized image의 semastic 예측

synthesized image에서 attribute를 보다 잘 예측하기 위해서 CelebA dataset(with ResNet-50 network)를 이용하였다. 이 ResNet-50 network는 smile, age, gender, eyeglasses 정도를 예측하기 위해서 5개의 landmark를 추출한다. (i.e., left eye, right eye, nose, left corner of mouth, right corner of mouth) landmarks에서 attribute의 binary attribute 값을 계산한다.(using softmax cross_entropy loss)

2) Get random image

PGGAN과 StyleGAN의 latent space에서 무작위로 500K image를 sampling하였다. 이렇게 한 이유는 다음과 같다.

골고루 sampling되었는지 확인하기 위해서
충분한 wearing-glasses sample을 얻기 위해서(CelebA-HQ datset에는 적게 있었음)

3) Find the semantic boundaries in the latent space

500K synthesized image의 attribute score를 얻기 위해서 pre-trained attribute prediction model을 사용하였다. 500K images를 조건에 따라 골라내어 30% image를 남겼다.

512d latent code를 가지고 attribute score를 예측하는 SVM를 훈련시켰다.

4. Interpreting Face Representation

InterfaceGAN을 적용하고 state-of-the-art GAN model을 사용하여 face representation을 시행하는 부분이다.

4.1. Separability of Latent Space

3.1. 부분에서는 하나의 hyperplane이 latent space에 있는 경우를 살펴보았다. 여기에서는 해당 부분을 통해서 어떤 latent space가 attribute를 뽑아내기에 적합한지 알아본다. SVM classifier가 어떤 latent space에서 semantic을 잘 골라내는지 알아보는 실험을 한다.

4.1.1. PGGAN

4.1.2. StyleGAN

validation Set에서는 Z, W space 모두 잘 작동하는 모습을 보이고 있다.
SVM이 W space에서 Z space보다 잘 분리하고 있다. 그렇기에 W space를 기반으로 한 generator는 다양한 semantics을 배우리라 기대된다.

결론적으로, PGGAN latent space와 StyleGAN W,Z latent space 중에서 StyleGAN W latent space가 가장 적합해 보인다.

4.2. Semantics in Latent Space for Face Manipulation

해당 부분에서는 InterfaceGAN을 통하여 찾은 semantics에 대해 알아본다.

4.2.1. PGGAN

4.2.2. StyleGAN

InterfaceGAN을 StyleGAN에 적용시켜 보았다.

InterfaceGAN은 styleGAN에서도 잘 작동하였다. W space와 Z space의 attribute를 수정하여 보았다.
FF-HQ dataset을 추가적으로 학습시키니 StyleGAN은 다양한 semantics를 학습하게 되었다. 이로서 CelebA-HQ dataset을 학습한 PGGAN보다 나은 성능을 보였다.
여전히 몇몇의 attribute는 서로 영향을 미치는 것을 확인하였다.(Fig. 8.에 2번째 예시에서 사람이 늙어가면서 안경도 씌어지는 모습을 볼 수 있다.)
W space는 Z space보다 더 나은 모습을 보인다. 특히 많은 manipulation 과정에서 확연히 차이가 난다.

4.3. Disentanglement Analysis and Conditional Manipulation

여기서는 semantic간 disentanglement한 경우 latent representation를 살펴보고 conditional manipulation approach에 대해서 평가하기로 한다.

4.3.1. Disentanglement Measurement

어떻게 잘 encode를 해서 semantic을 latent space에 적용할지 살펴본다.
multiple semantic에서 존재하는 correlation에 대해서 살펴본다.
decouple correlated attribute의 테스트 방법을 살펴본다.

4.3.2. PGGAN

4.3.3. StyleGAN

Disentanglement Analysis

StyleGAN model은 FF-HQ dataset에서 학습시켰다. StyleGAN의 latent space W, Z에서 boundary correlation을 조사해 보았다.

Table2 (a) : FF-HQ dataset 에서 Smile과 gender는 entangled 하다.
W space(table2 (c))는 Z space(table2 (b))와 비교하였을 때 disentangled하다. ( W space에서 대부분 서로 orthogonal하다.)
W space은 semantic distribution이 real data와 일치하지 않는다. ( W space는 가우시안 분포 보다 더욱 복잡한 분포를 이루기 때문)

doublejy715 / Paper_review