Abstract

StyleGAN을 이용해서 사실적인 사진을 생성 가능하다. 그러나 semantic attribute를 조절하여 이미지를 생성하는 데는 퀄리티가 좋지 못하다.
본 논문에서는 entangled latent space에서 조건적으로 변환하는 방법을 소개한다.
sub-problem으로 attribute-conditioned sampling과 attribute-controlled editing을 살펴본다. 이것들은 StyleFlow에서 GAN latent space의 attribute features을 계속 normalizing하는 방법으로 해결한다.
StyleGAN의 face / car latent space를 이용하여 평가하였다.

1. Introduction

높은 퀄리티로 생성된 이미지를 사용자 마음대로 control 하는 것이 Computer Graphic의 오랜 목표였다.
보통 세세한 3D 모델을 만들고, 물체 또는 텍스쳐로 직접 꾸미곤 하였다. 그러나 실제적 사진에 특성을 부여하는 것은 여전히 어려운 문제이다.
다른 대안으로 GANs을 이용하는 방법이 있다.

2. Related work

Generative Adversarial Network Architecture

생성적 적대 신경망
Generator와 Discriminator가 서로 경쟁하며 학습하는 구조
Conditional GANs
input에 conditional information 정보를 추가적으로 넣어주는 GAN
class label 이나 다른 modality의 데이터 등을 추가로 넣어준다.
G : x,z -> y(input image : x / randomly sampled vector : z / output image : y)
CGAN은 이미지를 conditioning 정보로 이용한다.
pix2pix, BicycleGAN, pix2pixHD, SPACE, MaskGAN 등 존재

Applications of Conditional GANs

semantic image manipulation 하기 위한 훌륭한 도구
사람의 머리, 눈 등의 특징을 바꾸거나, sketch 2 color image 형식으로 바꾸기가 가능하다.
FaceShop / SC-FEGAN 이 존재( 해당 논문들은 editing할 영역을 미리 정의하여 준다는 점이 흥미롭다.)
PSGAN(transfer makeup) / hair editing 기능을 가진 GAN도 존재한다.

Image Editing by Manipulating Latent Codes

latent code를 수정하여 image editing을 실현하고자 함.
처음에는 콧수염이 있는 사람의 latent vector와 없는 사람의 latent vector의 차이를 계산하여. 다른 사람에게 부여하는 방법을 진행
StyleRig 논문에서는 face rigging information을 변환하여 face manipulation을 StyleGAN latent space에서 실행하였음. 이 논문에서는 다른 논문보다 뛰어난 결과를 보여줌.
몇몇은 StyleGAN의 latnet space에서 linear manipulation을 실시하였음. latent space vecotr를 찾아 의미 있는 editing을 시도함.(InterFaceGAN, GANSpace)
본 논문에는 non-linear model의 latent space에서 disentangled result가 다른 논문에 비하여 잘 나온 모습을 볼 수 있음.

Embedding Images into the GAN Latent Space

image 2 vector
주로 3개의 technique가 필요하다.
1. image를 latent space에 적합하게 embedding 하는 encoder network를 만들어야 한다.
2. latent code를 개선하는 optimization algorithm을 만들어 output image를 잘 만들도록 학습한다.
3. 마지막으로 2 기술을 합하여 첫번째 기술은 근사한 embedding vector을 얻고, optimization algorithm으로 수정하는 일을 거친다. (아마 따로 학습하기 때문에 3단계라 칭한 것 같다.)

Neural Rendering

Neural Rendering은 neural network를 이용하여 텍스트의 장면 설명에서 이미지를 생성하는 방법을 의미한다.

3. Overview

Sub tasks

attribute-conditioned sampling : target attributes의 high-quality realistic image를 얻기 위해서
attribute-controlled editing : 주어진 이미지를 editing한 모습을 얻고 싶어서

Generating realistic image

generating realistic image를 위해서 StyleGAN을 이용한다. target attribute를 가진 고퀄리티 이미지를 생성해 내기 위함
sampling은 StyleGAN, StyleGAN2로부터 진행된다.
1. StyleGAN의 latent space에서 z_{s}(R^{512})을 추출하고, non-linear mapping을 가지고 W space에 embedding 시킨다.
2. embedding 된 w vector를 가지고 이미지를 생성한다.(R^{3x1024x1024}) z vector는 여러 차원의 정규 분포에서 sampling되고, w vector는 StyleGAN2의 generator network에서 18개의 다른 지점을 정규화 하는데 사용된다.
  - w은 18개 동일한 vector 위치를 이용하고, w+는 18개의 다른 지점을 이용한다.
  - StyleFlow를 학습할 때는 W latent space를 이용하였으며, real image를 editing 할 때는 W+를 이용하였다.

Measure attribute of any image

class-specific attribute function A를 이용하였다. 전형적인 classifier network이다.
출력으로 A(I) = a (a는 class 개수의 vector를 가진다.)
본 논문에서는 사람 얼굴에서 17가지의 지표를 선정하였다.

Solving the first task(attribute-conditioned sampling)

z vector를 샘플링하고 mapping 기능에 학습시키는 기호를 다음과 같이 표현한다. (a : target attribute)
결과물은 a와 z vector의 중간 값이 나올 것이다.
StyleGAN에 decode될 때 이 weights는 target attribute와 비슷한 이미지 샘플을 생성할 것이다.
Image I{0}를 StyleGAN space에 project시켜 w{0}를 얻는다.
(Abdal et al. 2019; Karras et al.2019)를 이용하여 I(w{0})와 I{0}를 비슷하게 만들 수 있게 한다.
Section 5에서는 밑의 식에 대해서 설명한다. 자세하게 말하자면 conditional continuous normalizing flow(CNF)를 이용하여 네트워크를 어떻게 학습하였는지 서술한다.

Solving the second task(attribute-controlled editing)

Image I{0}가 주어지면 StyleGAN space에 project하여 w{0}를 얻습니다. 얻은 w_{0}를 다시 image generate하여 원래 이미지와 비슷하게 학습시킵니다.
목적을 상기하자면, 현재 이미지를 edit하여 유저가 원하는 attribute로 바꿔주는 것이다.(to a_{t}) 이를 위해서 위와 같은 식을 만족해야 한다.
w{0}를 만들 수 있는 z{0}를 생성하기 위해서 위 inverse lookup식을 이용한다. 위 식은 CNF network를 반대 순서로 진행하여 구현 가능하다.
마침내 동일한 CNF network를 이용하여 edited image(I_{t})를 얻을 수 있다.

전체 식

4. Normalizing Flows

Normalizing flow는 종종 역변환의 과정으로 실현된다. 미지의 분산 map에서 인지된 분산 map으로 mapping 하게 해 준다.
이 inverse mapping은 간단하게 순환 과정으로 보이게 할 수 있다.

4.1. Discrete Normalizing Flows

4.2. Continuous Normalizing Flows(CNF)

normalizing flow은 지속적인 formulation으로 발생된다. ODE neural을 이용하여

5. Method

논문에서는 StyleGAN1/2 W space 안의 w latent vector(512 dimension)를 고려하였다.
서로 다른 2개 latent vector 간의 conditional mapping에 중점을 둔다.
두 도메인 간에 semantic mapping이 서로 관련이 있음을 학습하고 사실적인 editing application을 수행하도록 한다.

5.1 Dataset preparation

work flow을 위해서는 dataset에 다음이 준비되어야 한다.

StyleGAN 1/2 의 Gaussian Z space에서 추출한 10K samples
disentangled W space를 가진 model(truncate = 0.7로 이용)( 논문에서는 StyleGAN 권장, 그렇지 않으면 저퀄리티 이미지가 생긴다.)
논문에서는 w latent vector를 StyleGAN 1/2 generator input으로 제공하였다.( w latent vector을 이용해서 image를 만드는게 더 좋은 결과를 나오게 함)
이미지 특성을 조건부 조절하기 위해서 face classifier network 'A'를 이용하였다.(input : image, output : At / 아마 attribute 당 score 일 듯)
At는 Microsoft Face API dataset을 training dataset으로 이용하였음. 이 API는 다양한 attributes가 주어진 face image set를 제공한다.
main attribute : gender, pitch, yaw, eyeglasses, age, facial hair, expression, baldness
lighting attribute에는 DPR model을 이용하였다. DPR model은 input은 1 image / output으로 9-dimentional vector를 출력한다.
최종적으로 at는 17-dimension을 가지게 된다.(main attribute : 8개 / lighting attribute : 9개)

5.2 Attribute-translation Model

기반 모델

function ϕ 의 conditional continuous normalizing flows을 위하여 gate-bias modulation network model을 이용하였다.
위 model은 network에 input으로 들어오는 weight와 bias에 차원당 scaling factor를 부여하기 위해서 이용한다.(network는 input code의 원하는 identity를 editing하도록 학습한다.)
위 model은(FFJORD를 기반으로 한다.) Image2Image translation task처럼 image attribute를 2D/3D 형식으로 tensor를 만들어 낸다.

CNF(Conditional continuous Normalizing Flow) block

network에 condition information을 주기 위해서 이용.
1. time 변수 t를 attribute space와 차원이 같도록 broadcast한다.(t -> B)
2. attribute variable at를 channel 범위 concatenation을 적용한다.
3. B와 concatened at를 더하여 새로운 변수 at+를 만든다. 그리고 at+를 conditional attribute variable로 network에 넣어준다.
4. inference time을 통해 attribute domain에 target image방향의 linear interpolation을 부드럽게 변환하게 한다.
5. 안정적인 학습을 위해서 4 stacked CNF functions와 2 Moving Batch norm functions를 이용한다. (2-3 CNF를 이용할 경우 data에 overfit이 발생하였음을 관찰하였다.)
6. at를 input과 같은 shape을 가지게 하기 위해서 convolutional layer / linear layer network를 이용하였다.
7. 이것으로 input tensor에 gate-bias modulation을 하였다.
8. final output tensor는 Tanh non-linearity를 지나 다음 normalizing flow로 이동한다.
이 방법의 주요 관점은 각 attribute가 시간마다 entangled vector filed를 학습한다는 점이다.(??) 따라서 edit시 원하지 않은 다른 attribute와 같이 변할 수 있다. 대신에, flow network를 학습시켜서 joint attribute learning을 사용하는게 목적이다.
section 7에서 editing quality를 올리기 위한 joint attribute learning을 설명한다.
Joint attribute training 은 각 속성에 대한 안정적인 조건부 벡터 필드를 학습한다.

5.3 Training Dynamics

학습은 주어진 attribute set at를 data w와 비슷하게 만들어 주는 것이 목표이다.
이를 수식으로 표현하면 다음과 같다.
또한 N을 Gaussian probability density function 과정을 거친다.

joint conditional continuous normalizing flow를 학습하기 위한 알고리즘이다. epochs : 10 / batch size : 5 / training speed : 1.07 ~ 2.5 iter/sec 로 CNF를 설정하였다.(표 4를 참조하라) / GPU : Nvidia Titan XP / parameters : 1128449 / final log-likelihood : -4327872 / inference time : 0.61sec / tolerances : 1x10e^-5 / Adam optimizer(learning rate : 1x10e^-3)

6. Attribute-conditioned Sampling and Editing

framework의 formulation 학습의 이점은 sampling에서 보인다.
두 지점 간의 mapping 학습은 주로 z <-> w 하는 경우 주로 이용된다. 또한 벡터들을 조종할 때도 쓰이며 다른 semantic으로 바꿀 때도 이용한다.

6.1 Conditional Sampling

한번 학습된 후, continuous normalizing flow에 의해 Gaussian prior modelled w latent space에서 샘플링 된다.
원하는 attribute variable at를 설정하고, z latent vector를 여러개 뽑아낸다. 이 벡터들은 학습된 conditional CNF network를 지나게 된다.
학습된 vector field는 latent vector w로 변환된다.(Section 7에서 주어진 attribute를 반영한 결과를 보인다.)
결과가 잘 나오고, attribute가 많이 변한 모습을 보인다.

6.2 Semantic Editing

여기서는 framework를 통해서 semantically edit된 이미지를 보인다.
vector manipulation은 adaptive하며 ODE로 vector field를 해석한다.
과거의 방법과 다르게(논문 주석 참조) latent vector w의 semantic edit는 W space 내에서 실행된다. 이것은 안정적인 edit 결과를 가져오게 한다.
결과를 Section 7에서 보여줄 것이며, 바로 뒤의 내용으로는 editing framework의 요소들을 소개한다.

6.2.1 Joint Reverse Encoding(JRE)

semantic editing operation을 위한 StyleFlow framework의 첫 단계는 JRE이다.
w와 at 변수를 joint하는 곳이다.
위에 설명하였던 face classifier API를 이용한다.

6.2.2 Conditional Forward Editing(CFE)

두번 째 단계는 CFE를 거치게 된다.
z0를 고정하고 image I로 semantic manipulation 변환한다.
semantic을 변환한다면, 주어진 vector z0 과 at'를 가지고 CFE를 거치게 한다.

6.2.3 Edit Specific Subset Selection

StyleFlow editing framework의 세번째 과정이다.
주어진 w'벡터에 W+ space를 이용하여 자연스러운 editing을 실시한다.

다음 indices가 최고의 효과를 내었다. Light (7 − 11), Expression (4 − 5), Yaw (0 − 3), Pitch (0 − 3), Age (4 − 7), Gender (0 − 7), Remove Glasses (0 − 2), Add Glasses (0 − 5), Baldness (0 − 5) and Facial hair (5 − 7 and 10)
2가지 version edit 방법을 소개한다.
빠르고 대강 edit하는 버전 : 모든 시간에 대해서 reproject를 실시하지 않음.
느리지만 정확한 edit하는 버전 : 모든 벡터들이 reproject함

7. Results

7.1 Datasets

FFHQ(1024x1024, high-quality face image)
LSUN-Car(512x384, car image)

7.2 Evaluation metrics

FID
face identity
edit consistency scores
7.2.1 FID
output sample의 퀄리티와 다양성을 고려한다. test image와 edited image를 비교한다.

7.2.2 Face identity score

edit의 퀄리티와 양을 평가하기 위해서 face identity score를 이용함
한 쌍의 이미지(before edit, after edit)를 제공하면, Euclidean distance와 cosine similarity를 계산한다.

7.2.3 Edit consistency score

7.3 Compared Methods

비교를 위해서 여러 버전을 이용함. Image2StyleGAN, InterfaceGAN, GANSpace, StyleRig
InterfaceGAN & Image2StyleGAN : retrained using StyleGAN2
GANSpace, StyleFlow : StyleGAN2
StyleRig : StyleGAN1

doublejy715 / Paper_review

StyleFlow: Attribute-conditioned Exploration of StyleGAN-Generated Images using Conditional Continuous Normalizing Flows #14

Abstract

1. Introduction

2. Related work

Generative Adversarial Network Architecture

Conditional GANs

Applications of Conditional GANs

Image Editing by Manipulating Latent Codes

Embedding Images into the GAN Latent Space

Neural Rendering

3. Overview

Sub tasks

Generating realistic image

Measure attribute of any image

Solving the first task(attribute-conditioned sampling)

Solving the second task(attribute-controlled editing)

전체 식

4. Normalizing Flows

4.1. Discrete Normalizing Flows

4.2. Continuous Normalizing Flows(CNF)

5. Method

5.1 Dataset preparation

5.2 Attribute-translation Model

기반 모델

CNF(Conditional continuous Normalizing Flow) block

5.3 Training Dynamics

6. Attribute-conditioned Sampling and Editing

6.1 Conditional Sampling

6.2 Semantic Editing

6.2.1 Joint Reverse Encoding(JRE)

6.2.2 Conditional Forward Editing(CFE)

6.2.3 Edit Specific Subset Selection

7. Results

7.1 Datasets

7.2 Evaluation metrics

7.2.1 FID

7.2.2 Face identity score

7.2.3 Edit consistency score

7.3 Compared Methods