About Training Speed - Githubissues

Vchitect / Latte

Latte: Latent Diffusion Transformer for Video Generation.

Apache License 2.0

1.44k stars 147 forks source link

About Training Speed #54

Open ZekaiGalaxy opened 3 months ago

ZekaiGalaxy commented 3 months ago

Hi, I try to reproduce the result of Latte-XL/2 on ffs dataset, but through my observation, the training speed on 8 A100 is quite slow, in comparison with the max_step=100w;

I used exactly the config in the repo, and It takes 1.5 day to generate 4w out of 100w step.

So I wonder

(1) is the provided checkpoint exactly 100w step checkpoint, or is it just a high enough threshold?

(2) Your training speed, how long does it take you to have a good result on ffs

Thank you!

maxin-cn commented 3 months ago

Hi, I try to reproduce the result of Latte-XL/2 on ffs dataset, but through my observation, the training speed on 8 A100 is quite slow, in comparison with the max_step=100w;

I used exactly the config in the repo, and It takes 1.5 day to generate 4w out of 100w step.

So I wonder

(1) is the provided checkpoint exactly 100w step checkpoint, or is it just a high enough threshold?

(2) Your training speed, how long does it take you to have a good result on ffs

Thank you!

Hi, thanks for your interest.

it is just a high threshold.
in my experience, two to three days (8 A100s) will yield acceptable results on FFS.

maoyunyao commented 3 months ago

Hi, thanks for your great work! May I ask how many iterations would it cost to reproduce the result on UCF101.

maxin-cn commented 3 months ago

Hi, thanks for your great work! May I ask how many iterations would it cost to reproduce the result on UCF101.

Hi, thanks for your interest. After several training resumes, I'm not sure how many steps I need to take to achieve good results. Perhaps training to 150k can achieve acceptable results.

maoyunyao commented 1 month ago

Thanks your reply, is the resume operation due to the loss exploding or vanishing?

maxin-cn commented 1 month ago

Thanks your reply, is the resume operation due to the loss exploding or vanishing?

No, I didn't experience any issues with loss exploding or vanishing. Training is stable.