Shimingyi / MotioNet

A deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video [ToG 2020]
https://rubbly.cn/publications/motioNet/
BSD 2-Clause "Simplified" License
562 stars 84 forks source link

Why the linear discriminator D works #16

Open wbhu opened 3 years ago

wbhu commented 3 years ago

Dear authors,

Thanks a lot for the amazing work and sharing the code. Accroding the appendix A in the paper, "discriminator D is a linear component (similarly to Kanazawa et al. [2018]), with an output value between 0 and 1, containing two convolution layers and one fully connected layer". However, as the last reponse in issue of the code for Kanazawa et al. [2018], it dose have activation function.

I'm wondering why a linear discriminator can classify whether a rotation speed is natural or not, as in my point of view, this classification is not trival.

Best, Wenbo

Shimingyi commented 3 years ago

Dear Wenbo,

Thanks for your feedback. For the discriminator in our network, we just adopt the idea from HMR and replace the input with speed.

I checked their code again. HMR use slim.conv2d which use a default activation_fn parameter rather than pure conv2d and I ignored it. But currently with this linear discriminator, we can still see the improvement in rotation part, that means there are some other reason. I will have more check and commit a new version once ready, also update some experiments in this thread.

Best, Mingyi

wbhu commented 3 years ago

Thanks for your quick response.

longbowzhang commented 3 years ago

Hi @Shimingyi Nice work! But I have two additional questions about the joint rotation discriminator which require your further explanation if possible.

  1. The quaternions output by the network is not normalized, and thus not a valid rotation. I am wondering why apply the forward finite difference w/o normalization?
  2. I think you would like to use the discriminator to regularize the angular velocity of the joint. However, according to the math explained on wikipedia , the forward finite difference is not equivalent to the angular velocity ω. Thus I am wondering the mechanism about the design of the discriminator?

Thanks a lot in advance.

longbowzhang commented 3 years ago

Hi @Shimingyi Just a follow-up comment, as you have mentioned in the paper

our discriminator judges the realism of temporal sequences of angular velocities.

However, I think the finite difference of quaternions, which are on a manifold, cannot be used to approximate the angular velocity. This is in contrast to velocity approximation in Euclidean space where finite difference works.

Shimingyi commented 3 years ago

Hi @longbowzhang

I also found this problem in my latest experiments. The rotation from CMU data is normalized but the predicted rotation is not normalized, so the modeling of distribution between these two datasets will confuse the network. We had another experiment which applied Euler angle on discriminater, there is a normalization step and we got the same conclusion.

But because of the ‘Linear’ problem, these conclusions are all not solid so I will clarify all of them in a new commit and update in this thread.

I agree with you, now calling it 'velocity' didn't make sense here, there is no high level meaning on the [q1 - q2]. I will trying something new like equation to represent it.

Thanks for the feedback very much!

Best, Mingyi

longbowzhang commented 3 years ago

Hi @Shimingyi,

Thanks a lot for your fast reply. I am also curious about the motivation of the Adversarial Rotation Loss section.

Due to the fact that the T-poses for different samples in the dataset are not aligned, namely, two similar poses might be represented by different rotations, thus, a direct loss on the rotations can not be applied, unless, the entire set is retargeted to share the T-pose, which is a time consuming operation. Due to the potential difference between the absolute values of rotations that represent the same pose, our network is trained to output rotations with natural velocities distribution using adversarial training. The idea is to focus on the temporal differences of joint rotations rather than their absolute values.

  1. The problem you would like to solve is that two similar poses might be represented by different rotations. But why this happens? due to different animation design pipeline?
  2. Besides, I am wondering whether the temporal differences of joint rotations can really help to solve this different joint rotation representation problem. Actually I guess using the actual angular velocity may help. Please correct me if I am wrong.

Looking forward to more dissuasion with you.

Shimingyi commented 3 years ago

Hi @longbowzhang

In the motion capture system, the motion will be represented by initial pose and related rotations. Because there is no standard to describe the initial pose, similar poses will be represented by different rotations. Like this example, it's from CMU dataset and Truebones BVH files bvh file, I set all the rotations to 0, you can see the final poses are different. If we want to make the left one to 'T' pose, we need to apply extra rotations to achieve it.

20201016130100

Regarding the angular velocity, we have some internal discussion already. I agree with you, the difference in manifold space cannot present 'velocity' which works in Euclidean space. We will find another way here, angular velocity is an option. Thanks for your useful suggestion!

JinchengWang commented 3 years ago

Hi @Shimingyi I have a few questions about the discriminator as well,

  1. How did the experiments with adding activations go? Would love to know if there's been any updates :D
  2. It seems that you are only using 3d joint positions from human3.6m and rotations from the CMU dataset. Isn't it possible to train a discriminator on the absolute values of rotations if you let the network output in the CMU format instead?
Shimingyi commented 3 years ago

Hi, @JinchengWang .

I have added the activation layer in the code, and the current pre-model should be fine on the network level. But I haven't updated the experiments on different representation of 'rotation differences', because I am busy in another project. I plan to do it in next month. For the discriminator, there should be two kinds of T-pose in our training data. One is the network prediction which is based on our T-pose and another one is based on the T-pose in CMU dataset. From my last comment in this thread, you can find these two T-poses are different, even inside CMU dataset, the T-pose will be influenced by different bone length setting, so they need to apply some rotation in the first line in the second part of bvh file to get a initial pose. So I will suggest maybe you can run some retargeting methods on rotation dataset, so it can be used as absolute value directly.

Please let me know if there are some questiones : )

Best, Mingyi