ChrisWu1997 / 2D-Motion-Retargeting

PyTorch implementation for our paper Learning Character-Agnostic Motion for Motion Retargeting in 2D, SIGGRAPH 2019
https://motionretargeting2d.github.io
MIT License
444 stars 86 forks source link

Questions about the network architecture #26

Closed catherineytw closed 4 years ago

catherineytw commented 4 years ago

Hi, I'm reading your paper and code very carefully recently, and have to say 'nice work'! But I have several questions about the network architecture.

  1. In the code, I noticed that whenever you encode the body and the view angle, you dropped out the two last rows(axis=1) from the input tensor, and I was confused and was wondering why? At first glance, I thought you put the pelvis joint at the end of the joint tensor, but I couldn't find any evidence, could you be kind to explain it? Below is the code that baffled me: m1 = self.mot_encoder(x1) b2 = self.body_encoder(x2[:, :-2, :]).repeat(1, 1, m1.shape[-1]) v3 = self.view_encoder(x3[:, :-2, :]).repeat(1, 1, m1.shape[-1])

  2. In the common.py, you initialized the network input and output channels as following:

        self.mot_en_channels = [self.len_joints **+ 2**, 64, 96, 128]
        self.body_en_channels = [self.len_joints, 32, 48, 64, 16]
        self.view_en_channels = [self.len_joints, 32, 48, 64, 8]
        self.de_channels = [self.mot_en_channels[-1] + self.body_en_channels[-1] + self.view_en_channels[-1],
                            128, 64, self.len_joints + 2]
    
        self.meanpose_path = './mixamo_data/meanpose_with_view.npy'
        self.stdpose_path = './mixamo_data/stdpose_with_view.npy'

I was wondering why you need two more channels for the motion encoder? Shouldn't it be the same as the body and view channels since the number of joints are the same?

Many thanks!

ChrisWu1997 commented 4 years ago

Hi @catherineytw,

Sorry for late reply, just saw this issue. Thanks for taking interest in our paper, regarding your questions:

  1. The last two dimensions of the input are global velocity in XY plane (check this line), so we can represent the joint positions in local coordinates (the other dimensions of the input). Since we treat skeleton and view angle as static/time-independent properties, we simply drop out the global velocity (time-dependent values) when encoding the body and view angle.

  2. Basically the same reason as for the first question: the input for motion encoder has two additional channels for global velocity in XY plane. Sorry we didn't address this clearly in the paper.

catherineytw commented 4 years ago

Thank you!! It is definitely helpful!!