EricGuo5513 / HumanML3D

HumanML3D: A large and diverse 3d human motion-language dataset.
MIT License
686 stars 65 forks source link

Questions on data preprocessing #74

Open LinghaoChan opened 1 year ago

LinghaoChan commented 1 year ago

Hi @EricGuo5513 , thanks for your efforts in providing such a significant dataset to the community. I would like to know some details about your data preprocessing stage and I met up with some problems as follows. I hope to get your official answer.

image
EricGuo5513 commented 1 year ago

Hi, thanks for your interests on our dataset. The follows explain these operations individually:

  1. In inverse_kinematics_np, these is no need to face z+. We only need to extract the forward direction as the root rotation. Intializing root_quat[0] as (1, 0, 0, 0) is kind of a mistake. Because in my own postprocessed data, at this step, all motions should have been adjusted to initally facing Z+. Here this initialization is a double-check step. However, this is a bug if you follows the provided script to obtain the data. I tried to fix it. This will change the resulting data, while the current version has been widely used. So I recalled the change. But in anyway, since our global rotation representation is velocity based, and this only change the first frame. I suppose it won't make big difference in the final results.
  2. Second time, this is to make all motion facing the Z+ at the begining. This is a data processing step, which makes all data have uniform initial direction. This basically rotates the whole motion with the angle of the first pose. Again, since our global rotation representation is velocity based, I guess this step can be skipped. But anyway, this make it safe.
  3. For the third and fourth time, it's not data processing. Here we want to disentangle the global rotations and local rotation/position/velocity. So for local rotation/position/velocity, we only have global-invariant local information. This disentanglement is easier for the network to learn. Therefore, you can notice we cancelled the global rotations for the positions and velocity of all poses.
  4. In get_cont6d_params, it firstly get rotation-invariant velocity and then gets the root rotation velocity from rotations. Again, we want to disentangle root rotation and root velocity.

Global position for local velocity: I get the idea from the work of PFNN. I guess it should okay to obtain local velocity from local positions. Actually, they may be identical. I haven't got time to validate this by my self. But I don't think it will make big difference. Difference: I didn't expect this. I guess that's because the calculation of these two have minor discrepancies. For example, the root velocity use qrot_np(r_rot[1:], xxx), while local velocity use qrot_np(np.repeat(r_rot[:-1, None], xxx). Actually we only need to keep the root velocity in practice. And during recovery, you should always use the root velocity (dim=1,2)

Hope these clarify your concerns.

LinghaoChan commented 1 year ago

@EricGuo5513 I have a similar problem regarding the first point. Could you please tell me the effect of the following operation? Why do we need to calculate the root_quat?

'''Get Root Rotation'''
target = np.array([[0,0,1]]).repeat(len(forward), axis=0)
root_quat = qbetween_np(forward, target)
for chain in self._kinematic_tree:
    R = root_quat
    for j in range(len(chain) - 1):
        # (batch, 3)
        u = self._raw_offset_np[chain[j+1]][np.newaxis,...].repeat(len(joints), axis=0)
        # print(u.shape)
        # (batch, 3)
        v = joints[:, chain[j+1]] - joints[:, chain[j]]
        v = v / np.sqrt((v**2).sum(axis=-1))[:, np.newaxis]
        # print(u.shape, v.shape)
        rot_u_v = qbetween_np(u, v)

        R_loc = qmul_np(qinv_np(R), rot_u_v)

        quat_params[:,chain[j + 1], :] = R_loc
        R = qmul_np(R, R_loc)

@EricGuo5513 I am also confused by this. What is the purpose?

rd20karim commented 10 months ago

@LinghaoChan @EricGuo5513 Could this be the source of the mismatch problem in body parts compared to the text reference, as I mentioned here [issue #85 ] It seems that not all motion initially faces Z+. For HumanML3D skeleton samples of poses that don't face Z+, this results in an incorrect text reference for the pose. The motion executed with the right hand is referenced in the text description as the left hand, and the same may happen with clockwise/counterclockwise, forward/backward...

LinghaoChan commented 10 months ago

@LinghaoChan @EricGuo5513 Could this be the source of the mismatch problem in body parts compared to the text reference, as I mentioned here [issue #85 ] It seems that not all motion initially faces Z+. For HumanML3D skeleton samples of poses that don't face Z+, this results in an incorrect text reference for the pose. The motion executed with the right hand is referenced in the text description as the left hand, and the same may happen with clockwise/counterclockwise, forward/backward...

Yep. I am still confused.

rd20karim commented 10 months ago

I think there is a relation between this issue and the others issues #55 #20 #45 #85 This Z+ initialization and Swapping seems not works as intended. Because, there still samples that doesn't face the camera view. When I run an animation of text reference "a person waving with the left hand". The person is indeed waving with the right hand. I don't know if somehow this doesn't appear in the SMPL representation, or these samples did not have been visualized

LinghaoChan commented 10 months ago

@rd20karim Can you provide the file with the error? Like, providing the filename.

rd20karim commented 10 months ago

@LinghaoChan
All files where the person doesn't face the camera view seems to have this problem for example from the Test Set

Skeleton Face the opposite view of the camera

Raise his left arm instead of right arm sample id 158 /references : a person raises his right arm and then waves at someone ,a person waiving looking straight and then turning attention to the left ,a person raises their hand turns to their right while waving and then stops and lowers their hand

The right leg executing motion instead of left sample id 55 /references :a person kicked with left leg ,kicking foot with arms towards chest ,a person holds both hands up in front of his face and then kicks with his left leg

LinghaoChan commented 10 months ago

@rd20karim Your index seem not the same with mine. For your id 158, it is 002651 for me.

I visualized the unmirrored and mirrored motions: 002651 M002651 The results seem good.

rd20karim commented 10 months ago

@LinghaoChan The problem seems to not appear in SMPL visualization as a thought, but in the skeleton-based visualization using the 3D joint coordinates of the files .npy the skeleton doesn't face the camera view and the left/right part are inversed in the description.

LinghaoChan commented 10 months ago

@rd20karim Could you please share the codes for visualization?

rd20karim commented 10 months ago

@LinghaoChan Yes, here is the code, maybe the path should be modified

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation, rc

sample_path = "./HumanML3D/new_joints/002651.npy"
joint_poses = np.load(sample_path) # shape (T,22,3)
shift_scale = joint_poses.mean(0).mean(1)
x = joint_poses[:,:,0]
y = joint_poses[:,:,1]
z = joint_poses[:,:,2]
min_x, min_z, min_y, max_x, max_z, max_y  = x.min(),z.min(),y.min(),x.max(),z.max(),y.max()
  def plot_frame_3d(x, y, z, fig=None, ax=None):
      if fig is None:
          fig = plt.figure()
      if ax is None:
          ax = plt.axes(projection='3d',zdir='y')
      ax.scatter(x, y, z, 'red',marker='.')
      ax.set_xlim3d([min_x, max_x])
      ax.set_ylim3d([min_z, max_z])
      ax.set_zlim3d([min_y, max_y])
      ax.set_xlabel('X Label')
      ax.set_ylabel('Y Label')
      ax.set_zlabel('Z Label')
      return ax, fig

  def animate_3d(x, y, z, fps=20):
      import matplotlib.pyplot as plt
      fig = plt.figure()
      ax = plt.axes(projection='3d')
      frames = x.shape[0]

      def animate(i):
          plt.cla()
          ax_f, fig_f = plot_frame_3d(x[i], y[i], z[i], fig, ax)
          return ax_f

      return animation.FuncAnimation(fig, animate,
                                     frames=frames, interval=1000. / float(fps), blit=False)
anims = animate_3d(x,z,y)
anims.save("_test.mp4")
rd20karim commented 10 months ago

@LinghaoChan The problem is solved, after the discussion with the author, I found that the y and z axis should not be swapped for HumanML3D (not KIT-ML) instead the camera view should be changed by the elevation and azimuth only. This simple detail creates a big difference in visualization, as swapping y and z result in another mirrored version, which doesn't face camera view necessarily.

LinghaoChan commented 10 months ago

@rd20karim I am sorry for not replying to you in time. Thanks for your clarification.

sohananisetty commented 5 months ago

Hi, thanks for your interests on our dataset. The follows explain these operations individually:

1. In inverse_kinematics_np, these is no need to face z+. We only need to extract the forward direction as the root rotation. Intializing root_quat[0] as (1, 0, 0, 0) is kind of a mistake. Because in my own postprocessed data, at this step, all motions should have been adjusted to initally facing Z+. Here this initialization is a double-check step. However, this is a bug if you follows the provided script to obtain the data. I tried to fix it. This will change the resulting data, while the current version has been widely used. So I recalled the change. But in anyway, since our global rotation representation is velocity based, and this only change the first frame. I suppose it won't make big difference in the final results.

2. Second time, this is to make all **motion** facing the Z+ at the begining. This is a data processing step, which makes all data have uniform initial direction. This basically rotates the whole motion with the angle of the first pose. Again, since our global rotation representation is velocity based, I guess this step can be skipped. But anyway, this make it safe.

3. For the third and fourth time, it's not data processing. Here we want to disentangle the global rotations and local rotation/position/velocity. So for local rotation/position/velocity, we only have global-invariant **local** information. This disentanglement is easier for the network to learn. Therefore, you can notice we cancelled the global rotations for the positions and velocity of all poses.

4. In get_cont6d_params, it firstly get rotation-invariant velocity and then gets the root rotation velocity from rotations. Again, we want to disentangle root rotation and root velocity.

Global position for local velocity: I get the idea from the work of PFNN. I guess it should okay to obtain local velocity from local positions. Actually, they may be identical. I haven't got time to validate this by my self. But I don't think it will make big difference. Difference: I didn't expect this. I guess that's because the calculation of these two have minor discrepancies. For example, the root velocity use qrot_np(r_rot[1:], xxx), while local velocity use qrot_np(np.repeat(r_rot[:-1, None], xxx). Actually we only need to keep the root velocity in practice. And during recovery, you should always use the root velocity (dim=1,2)

Hope these clarify your concerns.

For disentangling root orientation dont you have to use the inverse of orientation? Also r_rot = 1,0,0,0. So why will rotating by this have any effect?