High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs

Abstract

notify) 링크 저자의 PPT refer로 활용
conditional generative adversarial networks (conditional GANs)
semantic label (map) to realistic photo
- semantic segmentation 이미지로 그와 관련 mapping된 라벨영역들을 실제 이미지들로 합성하는 역할.
그동안의 연구는 실제 해상도가 많이 떨어졌음.
이 연구에서는 이러한 문제를 해결 = 2048 × 1024 = pix2pixHD
- novel adversarial loss
- new multi-scale generator
- new discriminator architecture
- 그리고, 두개의 추가적인 feature를 통해 상호 조작(visual manipulation) 가능함.
  - object instance segmentation information를 구체화(incorporate)한다. 그래서, object들의 조작(제거/추가/카테고리 변경)이 가능.
  - 같은 입력에서 다양한 결과를 생성하는 방법제시. interactive하게 editing 가능
    Instance-Level Image Synthesis
    
    The pix2pix Baseline
"Imaget o-image translation with conditional adversarial networks" 이란 연구, 일명 "pix2pix" 논문이 base 참고로 이 논문은 pix2pixHD
- 위 연구는 Image to image translation을 위해 conditional GAN 제안.
  - pair set (s_i, x_i) : s_i is a semantic label map 이고 xi는 s_i와 매핑되는 실제 사진
  - G : U-Net network based
  - D : patch-based fully convolutional network > FCN 논문 ?
    - D의 입력값은 semantic label map+corresponding image를 concat하는 형태
      - corresponding image는 매핑되는 리얼이미지 또는 G의 결과 이미지. 다음 이미지를 보면 명확함.
pix2pix 의 문제는 높은 해상도를 Generator하기엔 역부족
Improving Photorealism and Resolution
Coarse-to-fine generator
- Coarse-to-fine이란 단어는 예전 viola jones face detection에 많이 인용되었는데, 보통 tracking에서도..
- 아무튼, 일반 대상 object에 대해 대충 찾고 그 후에 detail하게 찾자는 의미. 아마 여기도.. 대충 translation하고 좀더 디테일한 translation를 나중에하지 않을까?
- G는 두 개의 sub-network G1, G2로 구성
  - G1 : global newtwork
    - "Perceptual losses for real-time style transfer and super-resolution" 연구가 base
    - low 해상도(half size) : 1024 × 512
  - G2 : local enhancer network
    - 높은 해상도 : 2048 × 1024
    - G2는 첫번째 G2에서는 G1과 동일한 input를 받고, 두번째 G2에서는 G1+첫번째 G2의 output 를 element-wise sum하여 input으로 받음
Multi-scale discriminators
- 고해상도를 가진 리얼과 합성 이미지들을 구별하기 우해서 큰 receptive field를 가져야함.
  - 이를 위해서는 보통 깊은 nets나 큰 filter를 가진 CNN이 필요. > 이는 큰 메모리가 필요 ㅠ
- 그래서 이 연구에서는 3개의 Multi-scale discriminator를 이용, G1, G2, G3
  - trained to differentiate real and synthesized images at the 3 different scales,
  - coarse-to-fine generator
    Improved adversarial loss
식 2를 향상시킴
feature matching loss
- D_k의 i th 번째 layer에서 추출한 feature :
- T는 total layer number
- 는 각 layer에서의 element 수
GAN loss
- VAE-GANs에서 사용되었던 loss와 비슷
- discriminator > feature matching loss & the perceptual loss > for image super-resolution and style transfer

Using Instance Maps

semantic label map 데이터셋을 이용 - pixel 레벨로 라벨링.
- plus Boundary improvement > Fig 4 b)를 함께 사용하면, 다음과 같은 결과를 얻을수 있음.
  - the channel-wise concatenation = instance boundary map + semantic label map + real/synthesized image

Multi-modal results using feature embedding

To generate diverse images and allow instance-level control,
- G의 입력값으로 ow-dimensional feature channel들을 이용.
- low-dimensional feature로 Encoder를 이용
- instance-wise average pooling layer : Encoder 결과에 대한 average pooling

chullhwan-song / Reading-Paper

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs #101

Abstract

Instance-Level Image Synthesis

The pix2pix Baseline

Improving Photorealism and Resolution

Improved adversarial loss

Using Instance Maps

Multi-modal results using feature embedding

실험