How long will the training take?

yifanlu0227 commented 3 months ago

Thanks for your impressive work!

I trained the base resolution model with 8 A100 GPU, but it shows the training time is over 1 week. Is this the normal training time?

Steps:  26%|██▌       | 106314/410550 [58:15:26<149:16:23,  1.77s/it, loss=0.0122, lr0=8e-5]
Steps:  26%|██▌       | 106372/410550 [58:16:59<145:08:15,  1.72s/it, loss=0.16, lr0=8e-5]  
Steps:  26%|██▌       | 106430/410550 [58:19:27<166:34:50,  1.97s/it, loss=0.151, lr0=8e-5]
Steps:  26%|██▌       | 106488/410550 [58:20:48<151:37:45,  1.80s/it, loss=0.0792, lr0=8e-5]
Steps:  26%|██▌       | 106546/410550 [58:22:49<158:58:54,  1.88s/it, loss=0.121, lr0=8e-5] 
Steps:  26%|██▌       | 106604/410550 [58:24:31<156:07:56,  1.85s/it, loss=0.0271, lr0=8e-5]
Steps:  26%|██▌       | 106662/410550 [58:26:46<167:58:01,  1.99s/it, loss=0.0966, lr0=8e-5]
Steps:  26%|██▌       | 106720/410550 [58:28:07<152:58:54,  1.81s/it, loss=0.0423, lr0=8e-5]
Steps:  26%|██▌       | 106778/410550 [58:30:36<171:55:15,  2.04s/it, loss=0.0431, lr0=8e-5]
Steps:  26%|██▌       | 106836/410550 [58:32:36<172:49:27,  2.05s/it, loss=0.0184, lr0=8e-5]
Steps:  26%|██▌       | 106894/410550 [58:34:31<171:09:10,  2.03s/it, loss=0.189, lr0=8e-5] 
Steps:  26%|██▌       | 106952/410550 [58:36:03<159:48:38,  1.90s/it, loss=0.148, lr0=8e-5]
Steps:  26%|██▌       | 107010/410550 [58:38:01<163:17:47,  1.94s/it, loss=0.126, lr0=8e-5]
Steps:  26%|██▌       | 107068/410550 [58:39:42<158:40:36,  1.88s/it, loss=0.423, lr0=8e-5]
Steps:  26%|██▌       | 107126/410550 [58:41:24<155:23:04,  1.84s/it, loss=0.0905, lr0=8e-5]
Steps:  26%|██▌       | 107184/410550 [58:43:09<154:28:57,  1.83s/it, loss=0.144, lr0=8e-5] 
Steps:  26%|██▌       | 107242/410550 [58:45:27<168:08:17,  2.00s/it, loss=0.0132, lr0=8e-5]
Steps:  26%|██▌       | 107300/410550 [58:47:20<166:58:52,  1.98s/it, loss=0.082, lr0=8e-5] 
Steps:  26%|██▌       | 107358/410550 [58:49:02<161:19:43,  1.92s/it, loss=0.0878, lr0=8e-5]
Steps:  26%|██▌       | 107416/410550 [58:50:58<163:25:30,  1.94s/it, loss=0.0937, lr0=8e-5]
Steps:  26%|██▌       | 107474/410550 [58:53:03<168:44:02,  2.00s/it, loss=0.213, lr0=8e-5]

flymin commented 3 months ago

I saw the speed is around 2s/it. If this is our default setting (bs=3 * 8GPUs), the speed is fine. Regarding training epochs, the generated results with 100-150 epochs should look well, which should finish in 2 days. Since the longer the training epochs, the better quantitative results you get, we set the training epoch to 450.

yifanlu0227 commented 3 months ago

Thanks for your reply!

cure-lab / MagicDrive

How long will the training take? #14