need more than 700 seconds for each round of training after python train. py

yuyukuy534 commented 1 year ago

Hello, I'm sorry to bother you. Why do you need more than 700 seconds for each round of training after python train. py? Is this normal

vittoriopippi commented 1 year ago

Hi @yuyukuy534, what do you mean by "round of training"? On our machines with an NVIDIA 2080 Ti, one epoch takes roughly 30 seconds. What's your hardware?

yuyukuy534 commented 1 year ago

Hello, @vittoriopippi ,I trained on an A100 GPU server, but the training time of each epoch is almost more than 700 seconds, what is the reason for this? Hope to be able to get your guidance and help

vittoriopippi commented 1 year ago

I've just cloned the repo and set up everything. And I've received this results:

Loading... files/resnet_18_pretrained.pth
Starting training
Epoch 0 14.29% running, current time: 10.55 s
Epoch 0 33.33% running, current time: 20.59 s
Epoch 0 52.38% running, current time: 31.75 s
Epoch 0 71.43% running, current time: 42.14 s
Epoch 0 90.48% running, current time: 52.21 s
{'EPOCH': 0, 'TIME': 55.58703565597534, 'LOSSES': {'G': tensor(6.4954, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(0.9818, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0.9818, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(-0.0286, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(-5.8736, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(21.6562, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.6772, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 1 16.67% running, current time: 10.00 s
Epoch 1 35.71% running, current time: 20.29 s
Epoch 1 54.76% running, current time: 30.49 s
Epoch 1 73.81% running, current time: 40.80 s
Epoch 1 92.86% running, current time: 50.82 s
{'EPOCH': 1, 'TIME': 53.38597583770752, 'LOSSES': {'G': tensor(3.3875, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(2.0659, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(2.0659, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0086, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(2.8220, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(20.6295, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.2199, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}

I'm sorry, but I can't help you. There are probably some issues with your machine.

vittoriopippi commented 1 year ago

I've changed the np.random.choice inside the model.py file to random.choices from the random module. With the same machine as before, these are the improvements:

Loading... files/resnet_18_pretrained.pth
Starting training
Epoch 0 35.71% running, current time: 10.08 s
Epoch 0 76.19% running, current time: 20.57 s
{'EPOCH': 0, 'TIME': 25.470508575439453, 'LOSSES': {'G': tensor(4.6899, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.4034, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(1.1612, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0.2422, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(-0.0126, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(-5.4993, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(20.0668, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.2669, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 1 38.10% running, current time: 10.72 s
Epoch 1 80.95% running, current time: 21.37 s
{'EPOCH': 1, 'TIME': 25.23715090751648, 'LOSSES': {'G': tensor(0.2600, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.0286, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0.0725, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0.9562, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0041, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(3.8107, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(49.2529, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.1246, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 2 38.10% running, current time: 10.43 s
Epoch 2 78.57% running, current time: 20.45 s
{'EPOCH': 2, 'TIME': 25.29304313659668, 'LOSSES': {'G': tensor(1.5281, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.2566, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(1.2566, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0222, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(8.0757, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(127.7535, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.3174, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}

I suggest you to do the same.

yuyukuy534 commented 1 year ago

OK, I'll try it. Thank you very much for your reply and guidance

aimagelab / VATr

need more than 700 seconds for each round of training after python train. py #17