How can motion be aligned with speech in the temporal dimension

JeremyCJM / DiffSHEG

[CVPR'24] DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation

https://jeremycjm.github.io/proj/DiffSHEG/

BSD 3-Clause "New" or "Revised" License

112 stars 9 forks source link

Closed ggg66 closed 4 months ago

ggg66 commented 4 months ago

I saw this method in your paper. Could you please tell me where the code implementation is located? Thank you for your reply, great work!

JeremyCJM commented 4 months ago

Sure. As mentioned in paper, the audio embedding and motion are aligned by directly concatenating along the temporal dimension. The corresponding code is here: https://github.com/JeremyCJM/DiffSHEG/blob/3ebf3058f48cba3da9146afb7623e9ec1ab9e9a5/models/transformer.py#L307

ggg66 commented 4 months ago

https://github.com/JeremyCJM/DiffSHEG/blob/3ebf3058f48cba3da9146afb7623e9ec1ab9e9a5/models/gaussian_diffusion.py#L1369 In calculating the loss, I noticed that the dimension C becomes twice its original size in this line of code, but I couldn't find where this transformation of shape occurs. Thanks for your reply