The epoch does not increase when training

geopavlakos / hamer

HaMeR: Reconstructing Hands in 3D with Transformers

https://geopavlakos.github.io/hamer/

MIT License

326 stars 28 forks source link

The epoch does not increase when training #17

Closed xiaoyudanaa closed 6 months ago

xiaoyudanaa commented 7 months ago

Dear author, I'm a student of Yebin Liu at Tsinghua University, I have a problem: I have trained the hamer on 8*A800 GPU as the original configuration, but the epoch is still 0 until I get to 1,000,000 steps and after I run the demo, there is a mismatch in the visualisation results, what can I do to fix it? Meanwhile, could you share the pre-processing steps for tar? Looking forward to your reply.

290401965-66449de3-7270-45d0-855f-5d1d6f4ffeee

xiaoyudanaa commented 7 months ago

When the training ends automatically, I get the following weights file:: epoch=0-step=1000000.ckpt

linjiangya commented 7 months ago

Same problem here. I tried to train on a single A100 and got a epoch=0-step=1000000.ckpt and a hamer_last.ckpt then load the epoch=0-step=1000000.ckpt but it's not working at all (always fluctuating). Moreover, the loss value seems not to converge in the log file and the epoch number is also 0.

Here are examples of using official checkpoints and checkpoints of mine:

(using epoch=0-step=1000000.ckpt) (using the officially released model hamer.ckpt)

Here's the tensorboard log:

xiaoyudanaa commented 7 months ago

Same problem here. I tried to train on a single A100 and got a epoch=0-step=1000000.ckpt and a hamer_last.ckpt then load the epoch=0-step=1000000.ckpt but it's not working at all (always fluctuating). Moreover, the loss value seems not to converge in the log file and the epoch number is also 0.

Here are examples of using official checkpoints and checkpoints of mine:

(using epoch=0-step=1000000.ckpt) (using the officially released model hamer.ckpt)

Here's the tensorboard log:

Hi, can I have your contact details? Let's discuss this together, my WeChat is: Weafree

linjiangya commented 7 months ago

Same problem here. I tried to train on a single A100 and got a epoch=0-step=1000000.ckpt and a hamer_last.ckpt then load the epoch=0-step=1000000.ckpt but it's not working at all (always fluctuating). Moreover, the loss value seems not to converge in the log file and the epoch number is also 0. Here are examples of using official checkpoints and checkpoints of mine: (using epoch=0-step=1000000.ckpt) (using the officially released model hamer.ckpt) Here's the tensorboard log:

Hi, can I have your contact details? Let's discuss this together, my WeChat is: Weafree

Sure I have added.

geopavlakos commented 6 months ago

Replied at issue #24. Could you pull the latest changes and try again?