Everybody Dance Now - Githubissues

Abstract

나 처럼 따라해봐라(do as I do)하는 개념의 motion transfer 연구
시공간적 smoothing(spatio-temporal smoothing)을 이용해 image-to-image translation를 frame단위에 이 문제에 적용함
소스와 타켓 사이의 중간 표현으로써 pose detection을 사용하여, pose 이미지로부터 타켓 물체의 모습까지 매핑하는것을 learn한다.
리얼리스트한 얼굴합성까지 포함하여, 시간적으로 포괄적인 비디오 generation을 위한 이 설정을 적용.
- We adapt this setup for temporally coherent video generation including realistic face synthesis.
이연구의 동영상 예제 : https://www.youtube.com/watch?v=PCBTZh41Ris&feature=youtu.be

METHOD OVERVIEW

Given = { source person' video, target person's video}
이 연구의 목표는 source person' video와 같이, 이와 같은 모션을 행하는 target person의 video를 generator하는것. = Fig.1처럼
이를 달성하기 위해, 이 연구에서는 3 stage로,
- pose detection -> pose stick figures 들을 만듦.
- global pose normalization > 소스와 타켓간의 body shape와 위치사이의 차이를 고려
- mapping from normalized pose stick figures to the target subject(피사체). > adversarial training 통해 target person의 이미지들을 "normalized pose stick figures(당연히 소스의 pose정보?)"의 형태로 mapping하도록 학습시킨다.
전체 학습과정
- pose detector P
- frame y from the original target video
- pose stick figure(x)를 생성 , x = P(y)
- 학습동안, 주어진 pose stick x와 target person의 합성 이미지를 매핑하려는 G를 학습하기 위해 (x,y) pairs 를 사용.
- 두개의 loss 즉, D loss와 pre-trained vgg 을 이용한 reconstruction loss = dist
  - 그래서, GT(target y)와 G(x)사이를 최적화.
  - D는 real(pose stick figure x, ground truth image y) or fake (pose stick figure x, model output G(x))인지 아닌지?
transfer setup : Fig 3의 아래부분
- 학습과 유사하게, source frame y는 pose detector P를 이용하여 pose 정보 추출하여 pose stick figure x를 생성.
- 그러나, 이 정보는 target video의 피사체들보다 다른 위치(서있는위치, 피사체가 작거나 크거나..)로써 나타나는 문제가 발생함. (당연한 문제..)
- 이러한 문제를 해결하기 위해서,
  - source pose가 target에 더 잘 align하도록 만들어야 하는데, 그렇게 만들려고 하면, target video x안의 포즈에 일치(consistent)되도록 source’s original pose x′에 잘 맞는 transform을 해야 한다. 따라서 이 연구에서는 이를 달성하기 위해, "global pose normalization Norm"를 적용한다.
- 이후에, original image of the source y′로 대응된 우리의 target person의 G(x)를 획득하기 위해서, normalized pose stick figure x를 학습된 모델 G에 넣는다.

POSE ESTIMATION AND NORMALIZATION

Pose estimation
- CPM같은 알고리즘. > OpenPose 이용.
- pose detector P를 미리 만들어야함. > (x, y)형태의 pose들을 detector하는...
- 이를 이용하여 Fig2의 왼쪽 그림 처럼 pose stick figure를 생성.
- by plotting the keypoints and drawing lines between connected joints
- 이 정보는 학습과정시 G의 input이 됨.
- 또한, transfer를 과정시에는 P는 source subject(피사체)의 pose를 평가하고 난후에, 이는 다음장(4.2장)에서 설명한 normalize한다.
  - normalized pose coordinate들은 G의 input인 "pose stick figures" 를 생성하는데 사용.
Global pose normalization
- 다른 video안에는 (당연히) 피사체의 사지 비율이 다르거나, 카메라가 멀리 또는 가까이 일수 있다.
- 그래서, Transfer section of Figure 3 처럼 서로 맞춰줘야 한다.
  - 두 피사체 사이의 모션을 transferring할 때, body shape, 비율을 일치시키기 위해, source person의 pose keypoint에 transform해야함.
- 그래서 일치(맞추기)시키기 위해,
  - 피사체(subject)의 높이, 그리고 발목의 위치(ankle position)를 분석한다. 그리고, 양쪽 비디오안에서 가깝고 먼 발목의 위치 사이의 linear mapping에 사용하여 이 transformation 를 찾음(번역이..?ㅠ)
  - 양쪽 비디오에서 기준이되는 점(ankle position)를 찾고, 이를 서로 일치(매핑)하여 transformation 한다는 의미인듯~
  - 이러한 확률적 정보를 획득(수집)한 후, pose detection에 대응하는 것을 기반으로 각 frame에 scale/translation를 계산.
- section 9(APPENDIX) 에 보다 자세한 내용
  ADVERSARIAL TRAINING OF IMAGE TO IMAGE TRANSLATION
based pix2pixHD
pix2pixHD기반으로 하되, 이 두개의 특성을 고려하여 수정
- temporally coherent video frames(Temporal smoothing) 그리고, realistic face image 합성처리.
왜 이 두 개념을 추가한 이유는?
- Temporal smoothing - 실제적으로 이미지가 아닌 동영상에서 적용되어야 하므로, 현재 frame과 이전 frame간에 끊김없이(자연스럽게) 이어주기 위함으로..
- realistic face image - Face GAN을 추가하는것인데, 이는 또한 얼굴에 대해 좀더 realistic하게 보여주기 위함 - 다른 특징보다 얼굴이 중요한 특징중의 하나므로,,, 춤추는데..원동영상의 얼굴의 표정을 현재 얼굴에 잘 맞춰서..한다면 이보다 좋은 결과가...

pix2pixHD framework

multi-scale discriminators D = (D1,D2,D3)
는 D인데, feature-matching loss
는 perceptual reconstruction loss
Temporal smoothing
G의 입장에서 볼때, 현재 t에 대한 pose stick figure x_t와 이전 프레임에서 생성된 G(t-1)를 concat하여 G에 넣는다.
D의 입장에서 볼때, (xt−1, xt ,G(xt−1), G(xt )) > fake, (xt−1, xt, yt−1, yt )>real
- 각각 4개정보를 concat하는듯..
  Face GAN
얼굴의 realistic를 강조하기 위한 또하나의 GAN이다.
얼굴 주위만 가지고 input으로, 사진이든, pose stick figure든,,
pose stick figure의 얼굴(x_F)과, G(x)의 얼굴부분을 concat하여 G_f에서 Generator하는데, 이때의 결과를 r = 와 resiual하게 G(x)의 얼굴부분( G_f(x))를 다시 결합하면, 가 된다.
그래서, GAN face loss는
- > real
  - face region of the input pose stick figure, face region of the ground truth target person image
- > fake
pix2pixHD를 기반으로 하는데, Global Generator network로만...
70x70 Patch-GAN discriminator
다음 fig는 smoothing, face gan을 결합할 때의 실험 결과를 보여주고 있다.