Several Issues Encountered During Model Training

czh-98 / STAR

Official code for "STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting"

https://star-avatar.github.io/

Apache License 2.0

37 stars 4 forks source link

Several Issues Encountered During Model Training #1

Closed AlvinYH closed 3 weeks ago

AlvinYH commented 2 months ago

Thank you for publicly releasing your code! However, I encountered several problems while training the model:

At https://github.com/czh-98/STAR/blob/master/lib/dlmesh.py#L909, it appears that mask and dense face are not on the same device. I resolved this by moving the mask to the GPU.
At https://github.com/czh-98/STAR/blob/master/lib/trainer.py#L691, modifying the retarget_pose attribute in the trainer class does not seem to alter its value. This causes a bug at https://github.com/czh-98/STAR/blob/master/lib/dlmesh.py#L874 because retarget_pose remains None. I'm unsure of the underlying reason, but I fixed this by encapsulating the function that sets retarget_pose within the dlmesh class.
I couldn't locate data/FLAME_masks/FLAME.obj after downloading the FLAME Vertex Masks and FLAME Mediapipe Landmark files as described in the README. Could you provide specific instructions on how to obtain this file?

Thank you!

czh-98 commented 2 months ago

Hi, thanks for your attention.

For 1, I checked the device, and it does in different devices. Although it still works on my own server, I have fixed it.
For 2, I run the code on my server. The training script does not raise errors and will update the pose, I do not know the reasons...😂
For 1 and 2, I think it might be because you have a multi-GPU server, while I only tested on a single GPU server, so I would suggest running the script with CUDA_VISIBLE_DEVICES=0 python xxx.
For 3, I just uploaded the FLAME.obj file for convenience.

Let me know if you have any other questions :)

Jackiemin233 commented 2 months ago

@AlvinYH For question 2, I think you may use the torch 2.0+. I have debugged the code, and find out the solution. That's because on the Line 111-113 in train.py, the self. model is compiled by torch and transferred from DLMesh to an Optimized model and cannot read initialized retarget_pose sequence. I delete these three lines and it works for me. : )

AlvinYH commented 2 months ago

@czh-98 Thanks for your reply! As @Jackiemin233 mentioned, I did use torch 2.0+, and deleting lines 111-113 fixed the bug. Thank you both! However, when using torch 2.0+, I encountered an inplace operation error: RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [25193]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. This error traces back to line 161 in lib/guidance/shape_reg.py during the computation of the Laplacian smoothness loss. I have since downgraded to torch 1.12 and resumed training. But I'm curious about the cause of this bug and if there is a solution other than downgrading the torch version.

czh-98 commented 2 months ago

@czh-98 Thanks for your reply! As @Jackiemin233 mentioned, I did use torch 2.0+, and deleting lines 111-113 fixed the bug. Thank you both! However, when using torch 2.0+, I encountered an inplace operation error: RuntimeError: one of the variables needed for gradient computation has been modified by an in-place operation: [torch.cuda.FloatTensor [25193]], which is output 0 of LinalgVectorNormBackward0, is at version 1; expected version 0 instead. This error traces back to line 161 in lib/guidance/shape_reg.py during the computation of the Laplacian smoothness loss. I have since downgraded to torch 1.12 and resumed training. But I'm curious about the cause of this bug and if there is a solution other than downgrading the torch version.

I tried torch 2.0+ and noticed this issue is due to the inplace operation loss[get_flame_vertex_idx()] *= 5. I modified it to avoid such operations as a+=b to a=a+b. Then it should also work for torch 2.0+.