1st

Title / Abstract / Figure

0e0a7ad76e054a2c8bee5b736aa10525-0001

2nd

Introduction / Conclusion / Figure

Introduction

GANs은 최근 style-based generative model을 이용한 혁명적인 image synthesis입니다. GANs은 가장 사실적인 synthesis image를 생성합니다. StyleGAN의 중간 latent space의 학습은 disentanglement properties를 가짐을 보이고 있습니다. 이것은 pretrained model이 넓고 다양한 image manipulation 기능의 수행 가능성을 의미합니다.

StyleGAN의 표현력 활용은 유저를 위해서 단순하고 직관적인 인터페이스 개발이 요구됩니다. semantic control을 위해 존재하는 방법은 많은 주석이 달린 데이터를 가지고 하는 수동적인 검사 방법이 있다(추가적으로 pretrained classifier가 필요하다). 다른 manipulation방법은 parametric model에서 latent space의 한 방향으로 따라 움직이는 것이다. 종류로는 StyleRig / StyleFlow 등이 있다.

게다가, 존재하는 조절 방법으로는 미리 학습된 속성에 대해서만 semantic direction이 존재한다. 존재하지 않는 direction으로 조절하기 위해서는 많은 양의 데이터가 필요하다.
본 논문에서는 최근에 발표된 Contrastive Language-Image Pre-training(CLIP) model을 활용한다. CLIP은 text 기반의 직관적인 semantic image manipulation이 가능하다. 이 방법을 사용하면 앞의 한계점을 해결 할 수 있다. 이 CLIP model은 400 million image-text 쌍의 데이터를 인터넷에서 가져와 학습시켰다. 자연어는 이미지를 넓은 범위에서 표현이 가능하기 때문에 CLIP과 StyleGAN generator를 통합하면 image manipulation을 위한 좋은 방법이 될 것이라 본다.

본 논문에서는 CLIP과 StyleGAN을 합치기 위해 다음 3가지 방법을 설명한다.

Text-guided latent optimization방법으로 CLIP model을 loss network로 설정한다. 이것은 가장 versatile한(?) 접근이지만, manipulation image에 적용하기 위해서 몇 초밖에 걸리지 않는다.
A latent residual mapper : 특정 text prompt에 대해서 학습되었다. latent space에서 시작 지점이 주어지면(input image의 latent vector) mapper는 latent space에서 local step을 이동한다.
생성된 text prompt를 StyleGAN's style space에 mapping하는 방법이다. 이것은 manipulation 정도와 disentanglement를 조절할 수 있게 해줍니다.

논문의 결과는 human faces / animals/ cars / churches 에 대해서 넓은 범위의 semantic manipulation이 가능함을 증명합니다. 이 manipulation 조작은 추상적인 것에서 구체적인 것, 광범위한 것에서 세밀한 것까지 다양하게 적용 가능합니다.

Conclusion

우리는 3가지 image manipulation method를 소개하였습니다. 이것은 StyleGAN의 강력한 generator와 visual concept encoding ability를 가진 CLIP을 통합함으로서 이루어 냈습니다. 우리는 이러한 기술이 넓은 범위의 특이한 image manipulation이 가능함을 보였습니다. 또한 CLIP을 이용하면 미세하게 특성 조절이 가능한 것을 증명하였고, 특별한 헤어 스타일에도 본 논문의 방법이 disentanglement한 방향으로 manipulation 가능함을 보였다.

3rd

Introduction / Related Work

Introduction

사실적인 사진을 얻기 위해서 StyleGAN을 많이 이용하지만, StyleGAN의 latent space만 가지고 manipulation을 하기는 어렵다.
manipulation을 위해서는 parametric model에서 latent space의 한 방향으로 따라 움직이는 방법이 존재한다.(ex. StyleRig / StyleFlow)
그러니 이 방법으로는 미리 학습된 속성에 대해서만 semantic direction이 존재하는 한계점이 있다.
본 논문에서는 이러한 한계점을 해결하기 위해서 CLIP과 StyleGAN을 합치는 작업을 한다. CLIP은 text 기반의 직관적인 semantic image manipulation을 가능하게 해 주므로 넓은 범위의 image manipulation을 가능하게 해 준다.

본 논문에서는 CLIP과 StyleGAN을 합치기 위해 다음 3가지 방법을 설명한다.

Text-guided latent optimization방법으로 CLIP model을 loss network로 설정한다. 이것은 가장 versatile한(?) 접근이지만, manipulation image에 적용하기 위해서 몇 초밖에 걸리지 않는다.
A latent residual mapper : 특정 text prompt에 대해서 학습되었다. latent space에서 시작 지점이 주어지면(input image의 latent vector) mapper는 latent space에서 local step을 이동한다.
생성된 text prompt를 StyleGAN's style space에 mapping하는 방법이다. 이것은 manipulation 정도와 disentanglement를 조절할 수 있게 해줍니다.

Related Work

Vision and Language

Joint representations

많은 연구는 Vision 과 Language(VL)를 넘나드는 모델에 대해서 연구한다. 특히 text 기반 image retrieval, image captioning, and visual question answering 등에서 많이 이용된다.

최근 CLIP(Contrastive Language-Image Pre-training)을 기반으로 하는 model은 주어지는 text와 image 간의 semantic 유사도를 평가할 수 있다. CLIP은 인터넷 검색을 통해 나오는 약 4억개의 Text-Image pair을 가지고 있어서 넓은 범위의 이미지 분류가 가능하다.

Text-guided image generation manipulation

conditional GAN을 이용하여 학습한 text-guided image generation이 존재한다. 이 GAN은 text embedding을 담당하는 pretrained encoder가 포함된다.

Latent Space Image Manipulation

StyleGAN의 중간 latent space는 image manipulation에 있어서 disentangle 하고 meaningful을 보였다. 몇몇 image manipulation의 학습 방법은 end-to-end 형식이다. 보통 주어진 image를 encode하여 manipulated image의 latent representation을 실시한다.
다른 방법은 latent path를 찾는 방식이다.

image manipulation은 보통 W or W+ space에서 실시한다. 다른 Wu 논문에서는 StyleSpace S를 사용하면, W and W+ space보다 더욱 disentangled한 모습을 보인다.
본 논문에서는 W+에서 image manipulation을 실시하도록 한다.

Reference

1. StyleClip review 2. StyleClip review

3. Clip review 4. Clip reivew

doublejy715 / Paper_review

[Skimming] StyleClip : Text-Driven Manipulation of StyleGAN Imagery #25

1st

Title / Abstract / Figure

2nd

Introduction / Conclusion / Figure

Introduction

Conclusion

3rd

Introduction / Related Work

Introduction

Related Work

Vision and Language

Latent Space Image Manipulation

Reference