XiangLi1999 / Diffusion-LM

Diffusion-LM
Apache License 2.0
1.02k stars 133 forks source link

Training on A100 #53

Open mathematiguy opened 1 year ago

mathematiguy commented 1 year ago

Hi there,

I'm training the model on an 80GB A100 gpu and I'm having trouble replicating the claim that the model trains for 200K steps in under 5 hours. So far I'm using the flags given in the README, but I'm wondering if you used any others to make training that fast on one GPU, such as increasing the batch size. It looks like my GPU utilisation is high so I feel like it should be converging faster.

In my case it seems to be taking twice as long, it looks like it will converge in around 10 hours for E2E. But since ROCStory takes 4 times as many steps (800K) I'm guessing that will take about two days which seems like a lot of extra time.

I understand you are very busy, so if you have time to respond that will be great. Otherwise I will just put up with the extra time for now.