Shimingyi / MotioNet

A deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video [ToG 2020]
https://rubbly.cn/publications/motioNet/
BSD 2-Clause "Simplified" License
554 stars 82 forks source link

some questions about the code!!! #20

Closed ccsvd closed 3 years ago

ccsvd commented 3 years ago

hi, @Shimingyi Thanks a lot for the amazing work and sharing the code. i read the paper and code, and i have some questions: 1、paper say random an integer value as clip length per iter,but i find the code in function set_sequences is only use once in the begin,so the clip will never changer at all epochs; 2、the pretrained model wild_gt_tcc is 73m,but use the python train.py -n wild -d 1 --kernel_size 5,3,1 --stride 3,1,1 --dilation 1,1,1 --channel 1024 --confidence 1 --translation 1 --contact 1 --loss_term 1101 trained model is 145m; 3、use python train.py -n wild -d 1 --kernel_size 5,3,1 --stride 3,1,1 --dilation 1,1,1 --channel 1024 --confidence 1 --translation 1 --contact 1 --loss_term 1101 train,it will be crash on self.branch_S because the stride is 3,acoording the model config at the end of paper,Es should use stride 1 not 3,but Eq should use 3 not 1,right? 4、paper say 'Since global information is discarded from the normalized local representation, we append to it the global 2D velocity (per-frame).'. what is the global 2d velocity? the code is input the normolized 2d(no global info) to train root-z via the self.branch_Q function,why not use the output_Q directly? 5、in the camera augment,the code use 'augment_depth',means only adjust the global translation,no orientaion,right ? 6、about the reference t-pose,what is the loss: loss_D = torch.mean(torch.norm((D_real - 1) 2)) + torch.mean(torch.sum((D_fake) 2, dim=-1)) mean?why use CMU dataset?

looking foward to your reply, thanks!

Shimingyi commented 3 years ago

Hi @chenshudong ,

Thanks for your feedbacks, some of them are my mirror mistakes in the code, let me explain more.

  1. I keep on updating the clips length in my own code, but for the released version I ignore to call this function after every epoch. It will be fixed in the next commit.
  2. I will not recommend evaluating the model with contact loss and confidence value, because they are designed for a wild interface which will not be optimized in specific test dataset. But 145mm error is a strange number. I checked it but I cannot reproduce this error, can you send me more about it?
  3. The network configuration can be changed based on the task. I will expect a wilder receptive field when we evaluate it on wild video because it will make the results smoother so we choose stride==3 at first convolution block. I found the reason causing crashing here, this kind of configration [k=5,3,1, s=3,1,1, stage=2] will be applied on a longer sequence length, requiring at least 101 frame. But I randomly select the clip length number between 40 and 200 which causes the crash. I will update the configrations here to fix this bug.
  4. The global information is predicted by branch_Q, you can check this code. The local 2d velocity is the position updates for the root joint in pixel wise. You can imagine that, when you see a person moving in an image, the translation in xy can be related to the global information in xyz space. So we use it and just predict a depth factor to describe the global.
  5. Yes, we didn’t apply augmentation in the global orientation.
  6. It’s a discriminator loss, usually used in adversarial training. I will suggest to read this tutorial firstly and then read our Adversarial Rotation Loss part. Because of the ambiguities in different T-pose configurations in datasets, we can only apply the adversarial loss on the angular velocity.

You can reply in this thread if there are any other question.

Best, Mingyi

ccsvd commented 3 years ago

Thanks for reply, @Shimingyi i thank i understand.i will read the tutorial. about the second point,my mean is the model size is 145M,not 145mm.i so sorry i did not describle clearly.so i want know the train config info to get 73M model.

Shimingyi commented 3 years ago

The model size if also related to the model configration like --stage_number or --channel_size. And a smaller size means I use a another configration for this task. By the way, I commit a new version so you won't meet the crashing problem in Q3. Pls check it :)

ccsvd commented 3 years ago

thanks,i will check it. by the way,could you please share the causal version code and model for real-time use?:)

Shimingyi commented 3 years ago

I don't have plan to release the causal version in this repo becuase wild video requires some tricks on real-time performance. In next monthes I will have a new project which can do better in real-time setting, you can follow that if it's attractive :)

ccsvd commented 3 years ago

hi,i follow the code add_noise() to point_2d_gt, and i check my poses_2d_noised's vaule is the same as yours. but when i decode pose2d to image space and display, i find the result is very very bad like this: 1606382849(1)

1606382737(1) it is jitter and mistake severely. i think it deviation the distribution of infer using 2d model.so i was confused about this.

Shimingyi commented 3 years ago

I think the results look fine.

We use two strategy to simualate the noise in wild video. Firstly we will add random noises to the pose location, and then we will delete some joint value which means missing detect(set to zeros which is same with Openpose). Refer: code

In the first figure, what you get is the result which the head joint has been setted to zero so it locates on root position. In this time, there is another value called confidence will be feed into network which can tell the network 'ignore this joint becuase it's not accurated'. It's the main idea how do we adapt the wild video with noised openpose output.

ccsvd commented 3 years ago

yes,i understand. for the missing point,will set coord and confidence to 0,right? i read the code add_noise() carefully,and test it,i find only set missing point's coord to 0,but the confidence is still a higher value(eg. 0.86). so its a bug or i understand wrong?

ccsvd commented 3 years ago

sorry again, i think the code: pose_array[deleted_index, joint_index 2] = 0 pose_array[deleted_index, joint_index 2 + 1] = 0 should be item_index for h36m index, not joint_index. if i make a mistake,please correct me.

Shimingyi commented 3 years ago

Yes they are bugs. I will push a new commit after the performance checking.