Closed yuyukuy534 closed 1 year ago
Hi @yuyukuy534, what do you mean by "round of training"? On our machines with an NVIDIA 2080 Ti, one epoch takes roughly 30 seconds. What's your hardware?
Hello, @vittoriopippi ,I trained on an A100 GPU server, but the training time of each epoch is almost more than 700 seconds, what is the reason for this? Hope to be able to get your guidance and help
I've just cloned the repo and set up everything. And I've received this results:
Loading... files/resnet_18_pretrained.pth
Starting training
Epoch 0 14.29% running, current time: 10.55 s
Epoch 0 33.33% running, current time: 20.59 s
Epoch 0 52.38% running, current time: 31.75 s
Epoch 0 71.43% running, current time: 42.14 s
Epoch 0 90.48% running, current time: 52.21 s
{'EPOCH': 0, 'TIME': 55.58703565597534, 'LOSSES': {'G': tensor(6.4954, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(0.9818, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0.9818, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(-0.0286, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(-5.8736, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(21.6562, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.6772, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 1 16.67% running, current time: 10.00 s
Epoch 1 35.71% running, current time: 20.29 s
Epoch 1 54.76% running, current time: 30.49 s
Epoch 1 73.81% running, current time: 40.80 s
Epoch 1 92.86% running, current time: 50.82 s
{'EPOCH': 1, 'TIME': 53.38597583770752, 'LOSSES': {'G': tensor(3.3875, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(2.0659, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(2.0659, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0086, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(2.8220, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(20.6295, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.2199, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
I'm sorry, but I can't help you. There are probably some issues with your machine.
I've changed the np.random.choice
inside the model.py
file to random.choices
from the random
module. With the same machine as before, these are the improvements:
Loading... files/resnet_18_pretrained.pth
Starting training
Epoch 0 35.71% running, current time: 10.08 s
Epoch 0 76.19% running, current time: 20.57 s
{'EPOCH': 0, 'TIME': 25.470508575439453, 'LOSSES': {'G': tensor(4.6899, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.4034, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(1.1612, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0.2422, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(-0.0126, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(-5.4993, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(20.0668, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.2669, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 1 38.10% running, current time: 10.72 s
Epoch 1 80.95% running, current time: 21.37 s
{'EPOCH': 1, 'TIME': 25.23715090751648, 'LOSSES': {'G': tensor(0.2600, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.0286, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0.0725, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0.9562, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0041, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(3.8107, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(49.2529, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.1246, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 2 38.10% running, current time: 10.43 s
Epoch 2 78.57% running, current time: 20.45 s
{'EPOCH': 2, 'TIME': 25.29304313659668, 'LOSSES': {'G': tensor(1.5281, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.2566, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(1.2566, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0222, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(8.0757, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(127.7535, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.3174, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
I suggest you to do the same.
OK, I'll try it. Thank you very much for your reply and guidance
Hello, I'm sorry to bother you. Why do you need more than 700 seconds for each round of training after python train. py? Is this normal