EricGuo5513 / text-to-motion

Official implementation for "Generating Diverse and Natural 3D Human Motions from Texts (CVPR2022)."
MIT License
486 stars 40 forks source link

Number of motions seems less than the Real one #29

Open rd20karim opened 1 year ago

rd20karim commented 1 year ago

Hello, I have a question regarding the method used to compute the number of motion. While investigating your Motiondataset and dataloader, it appears that the motion numbers during training are counted as the total number of motion snippets across all training subset, divided by the batch size. I can understand the reasoning behind this approach, but it leads to a significant reduction on the real data size. Specifically, the number of motions in the training set was initially 23,384. After removing motions under window_size 64, it decreased to 20,942, and further reduced to 14,435 training motions using the aforementioned method.

Your clarification on this behavior would be greatly appreciated. Thank you.

EricGuo5513 commented 1 year ago

Hi, I am not getting your problem very clearly. Could you also attach the part codes here?

rd20karim @.***> 于2023年7月25日周二 06:08写道:

Hello, I have a question regarding the method used to compute the number of motion. While investigating your Motiondataset and dataloader, it appears that the motion numbers during training are counted as the total number of motion snippets across all training subset, divided by the batch size. I can understand the reasoning behind this approach, but it leads to a significant reduction on the real data size. Specifically, the number of motions in the training set was initially 23,384. After removing motions under window_size 64, it decreased to 20,942, and further reduced to 14,435 training motions using the aforementioned method.

Your clarification on this behavior would be greatly appreciated. Thank you.

— Reply to this email directly, view it on GitHub https://github.com/EricGuo5513/text-to-motion/issues/29, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRYNB5VXMUEBB7VIULPV4TXR6ZMTANCNFSM6AAAAAA2W67HTQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

rd20karim commented 1 year ago

The final value of cumsum is the number of motion snippets of all a given split

image

For example when dataloader is constructed for validation We have the real number of motions 1300, but when we display the len(val_loader) it show’s 911

image

The length method of batch sampler is called, line 240 (around (116698/128) = 911) which assign as the number of motions during training the VQ_VAE on humanML3D

image

While the total the number of motion is before the augmentation process, so the mirrored motion is not counted

EricGuo5513 commented 1 year ago

Hi, at the stage of dataloader, the data set should already be the original motions + mirrored motions. And please note this MotionDataset is only for training the autoencoders. Also 911 is not the number of motions, but the number of iterations in each epoch. Hope these clarifies.

rd20karim @.***> 于2023年7月26日周三 04:45写道:

The final value of cumsum is the number of motion snippets of all a given split

[image: image] https://user-images.githubusercontent.com/62174833/256206377-56e3b3e7-0587-431d-b743-5ece2a576b2c.png

For example when dataloader is constructed for validation We have the real number of motions 1300, but when we display the len(val_loader) it show’s 911

[image: image] https://user-images.githubusercontent.com/62174833/256206554-7674c044-046e-4dbd-9da9-d43c948ffae2.png

The length method of batch sampler is called, line 240 (around (116698/128) = 911) which assign as the number of motions during training the VQ_VAE on humanML3D

[image: image] https://user-images.githubusercontent.com/62174833/256206633-e365e2a0-fe88-405a-a0c7-fe5e7fb4cd84.png

While the total the number of motion is before the augmentation process, so the mirrored motion is not counted

— Reply to this email directly, view it on GitHub https://github.com/EricGuo5513/text-to-motion/issues/29#issuecomment-1651534357, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRYNB5XUXFIDPDMFTTORM3XSDYNNANCNFSM6AAAAAA2W67HTQ . You are receiving this because you commented.Message ID: @.***>

rd20karim commented 1 year ago

Okay, thanks for clarifying. It seems that you trained the autoencoder on each sampled motion of 64 frames, considering them individually from all the training data.

Regarding your work on text generation using TM2T with HumanML3D, the preprocessing was limited to filtering motions that have less than 3 text descriptions. Additionally, the motion length was constrained to be between 40 and 200 frames. I wanted to know if I am missing any other details. Is there any constraint on the maximum text length generated during inference or training?

I was also curious about how the given splitting was generated because it seems to significantly impact the overall performance when the split is changed.