MDM: Human Motion Diffusion Model

Paper : https://arxiv.org/pdf/2209.14916.pdf GitHub : https://github.com/GuyTevet/motion-diffusion-model Project page : https://guytevet.github.io/mdm-page/ Youtube : https://youtu.be/9MqPxlwx2CQ

Introduction 이번에 소개할 논문은 2022년에 소개된 human motion을 위한 classifier-free, diffusion-based generative model인 MDM(Motion Diffusion Model)입니다. 요즘 핫한 diffusion을 이용한 모델 중 하나인데요, text를 쓰면 human motion을 자동으로 생성해주는 모델입니다. 며칠 전 카카오브레인에서 비슷한 연구인 FLAME 모델을 발표했는데요, MDM 보다 FLAME 성능이 좋다고 하여 MDM 논문부터 읽어보게 되었습니다. 이 논문에서 주목할만한 design-choice는 diffusion step에서 noise가 아닌 sample의 예측이라는 것 입니다. (참고로 diffusion model에서는 noise를 발생시켜 학습하게 됩니다.) 이는 foot contact loss와 같은 motion의 location 및 velocity에 대해 설정된 geometric loss의 사용을 용이하게 합니다. MDM은 다양한 conditioning mode와 다양한 생성 작업을 가능하게 하는 일반적인 접근 방식이라고 할 수 있습니다. 이 모델은 lightweight resource로 학습되었는데도 불구하고 text-to-motion 및 action-to-motion task에 대한 SOTA를 달성했습니다.

먼저, MDM은 U-net backbone 대신 transformer 기반 모델이며, 모델이 가볍고 motion data(joint collection)의 temporal 및 non-spatial 특성을 잘 반영할 수 있습니다. 또한 다양한 형태의 conditioning을 가능하게 합니다. text-to-motion, action-to-motion, unconditioned generation의 3가지 작업을 수행할 수 있습니다. text-to-motion 작업에서 MDM 모델은 HumanML3D 및 KIT에서 SOTA를 달성합니다. single mid-rage GPU에서 3일만 training 하면 된다고 하네요.

Motion Diffusion Model 이 모델의 목표는 임의의 condition에서 human motion을 합성하는 것입니다. 이러한 condition은 audio, natural language(text-to-motion) 또는 discrete class(action-to-motion)와 같이 synthesis를 나타내는 real-world signal 일 수 있습니다. 또한 unconditioned motion generation도 가능하며 이를 null condition c = ∅ 로 표시합니다. 생성된 motion은 joint rotation 또는 position $x^i \in \mathbb{R}^{J \times D}$으로 표현되는 human pose이며, 여기서 $J$는 joint의 수이고, $D$는 joint representation의 dimension 입니다. MDM은 location 또는 rotation 또는 둘 다로 표현 할 수 있습니다.

Model : 모델 G는 encoder-only architecture로 straight forward transformer(Vaswani et al., 2017)로 구현합니다. transformer architecture는 temporally aware 되게 임의의 length motion을 학습할 수 있으며 motion domain에 대해 well-proven 입니다. noise time-step 및 condition code care는 각각 별도의 feed-forward network에 transformer dimension으로 투영된 다음 합산되어 token $z_{tk}$를 산출합니다. noise가 있는 입력 $xt$의 각 frame은 transformer dimension으로 선형적으로 투영되고 standard positinal embedding으로 합산됩니다. $z{tk}$와 투영된 frame은 encoder로 공급됩니다. 첫번째 output token을 제외하고 encoder 결과는 원래 motion dimension으로 다시 투영되며 prediction $\hat{x}$ 역할을 합니다. CLIP으로 text prompt를 encoding하여 text-to-motion을 구현하고 class 별로 embedding을 사용하여 action-to-motion을 구현합니다.

Sampling : $p(x0 | c)$로부터 sampling은 Ho et al. 방법에 따르면 반복적인 방식으로 수행됩니다. **모든 time step $t$에서 clean sample을 예측하고 $x{t-1}$로 다시 noise 처리 합니다. 이는 $t_0$이 달성될 때 까지 $t = T$ 에서 반복됩니다. 본 논문의 모델 G를 classifier-free guidance로 학습합니다. 실제로 G($x_t$, t,∅)가 $p(x_0)$에 근접하도록 sample 10%에 대해 c=∅를 무작위로 설정하여 conditioned and unconditioned distribution을 모두 학습**합니다. 그 다음 G를 sampling 할 때 다음을 사용하여 두 변형을 보간하거나 extrapolating 하여 diversity and fidelity의 trade-off를 만족할 수 있습니다.

Editing : motion data에 diffusion inpainting을 적용하여 temporal domain에서 motion in-betweening을 가능하게 하고, spatial domain에서 body part editing을 가능하게 합니다. editing은 training 없이 sampling 중에만 수행됩니다. motion sequence input의 subset이 주어지면, 모델을 sampling 할 때 각 iteration에서 $x_0$을 motion의 input part로 ovewrite 합니다. 이는 누락된 부분을 완성시키면서 원래 input의 일관성을 유지하도록 생성하게끔 합니다. temporal setting에서 motion sequence의 prefix, suffix frame을 input으로 하고 motion in-betweening 문제를 해겨합니다. 또한 conditionally or unconditionally (by setting $c=∅$)로 설정할 수도 있고, spatial setting에서 동일한 completion technique을 사용하여 body의 일부는 그대로 유지하면서 condition $c$에 따라 합성될 수 있음을 보여주게 됩니다.

위 그림에서 파란색 프레임은 motion input을 나타내고, 청동색은 generated motion을 나타냅니다. Motion in-betweening (left+center)은 동일한 모델에 의해 text or without condition으로 수행될 수 있습니다. 또한 lower body joints는 고정되고 upper body는 input text prompt에 맞게 변경될 수 있습니다. 아래 그림을 보시면 "Throw a ball" 이라고 text prompt를 주었을 때 하체는 고정되고 상체만 움직이는 것을 볼 수 있습니다. 👏🏻👏🏻👏🏻

Experiments

Text-to-motion
Action-to-motion

full version : https://eehoeskrap.tistory.com/690

eehoeskrap / PaperReview

MDM: Human Motion Diffusion Model #5