aimagelab / VATr

MIT License
72 stars 4 forks source link

need more than 700 seconds for each round of training after python train. py #17

Closed yuyukuy534 closed 11 months ago

yuyukuy534 commented 11 months ago

Hello, I'm sorry to bother you. Why do you need more than 700 seconds for each round of training after python train. py? Is this normal

vittoriopippi commented 11 months ago

Hi @yuyukuy534, what do you mean by "round of training"? On our machines with an NVIDIA 2080 Ti, one epoch takes roughly 30 seconds. What's your hardware?

yuyukuy534 commented 11 months ago

Hello, @vittoriopippi ,I trained on an A100 GPU server, but the training time of each epoch is almost more than 700 seconds, what is the reason for this? Hope to be able to get your guidance and help

927b3c6361e24aa206da9cd85dfe52b
vittoriopippi commented 11 months ago

I've just cloned the repo and set up everything. And I've received this results:

Loading... files/resnet_18_pretrained.pth
Starting training
Epoch 0 14.29% running, current time: 10.55 s
Epoch 0 33.33% running, current time: 20.59 s
Epoch 0 52.38% running, current time: 31.75 s
Epoch 0 71.43% running, current time: 42.14 s
Epoch 0 90.48% running, current time: 52.21 s
{'EPOCH': 0, 'TIME': 55.58703565597534, 'LOSSES': {'G': tensor(6.4954, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(0.9818, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0.9818, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(-0.0286, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(-5.8736, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(21.6562, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.6772, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 1 16.67% running, current time: 10.00 s
Epoch 1 35.71% running, current time: 20.29 s
Epoch 1 54.76% running, current time: 30.49 s
Epoch 1 73.81% running, current time: 40.80 s
Epoch 1 92.86% running, current time: 50.82 s
{'EPOCH': 1, 'TIME': 53.38597583770752, 'LOSSES': {'G': tensor(3.3875, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(2.0659, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(2.0659, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0086, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(2.8220, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(20.6295, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.2199, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}

I'm sorry, but I can't help you. There are probably some issues with your machine.

vittoriopippi commented 11 months ago

I've changed the np.random.choice inside the model.py file to random.choices from the random module. With the same machine as before, these are the improvements:

Loading... files/resnet_18_pretrained.pth
Starting training
Epoch 0 35.71% running, current time: 10.08 s
Epoch 0 76.19% running, current time: 20.57 s
{'EPOCH': 0, 'TIME': 25.470508575439453, 'LOSSES': {'G': tensor(4.6899, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.4034, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(1.1612, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0.2422, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(-0.0126, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(-5.4993, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(20.0668, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.2669, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 1 38.10% running, current time: 10.72 s
Epoch 1 80.95% running, current time: 21.37 s
{'EPOCH': 1, 'TIME': 25.23715090751648, 'LOSSES': {'G': tensor(0.2600, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.0286, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0.0725, device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(0.9562, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0041, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(3.8107, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(49.2529, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.1246, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}
Epoch 2 38.10% running, current time: 10.43 s
Epoch 2 78.57% running, current time: 20.45 s
{'EPOCH': 2, 'TIME': 25.29304313659668, 'LOSSES': {'G': tensor(1.5281, device='cuda:0', grad_fn=<SubBackward0>), 'D': tensor(1.2566, device='cuda:0', grad_fn=<AddBackward0>), 'Dfake': tensor(0., device='cuda:0', grad_fn=<DivBackward0>), 'Dreal': tensor(1.2566, device='cuda:0', grad_fn=<DivBackward0>), 'OCR_fake': tensor(0.0222, device='cuda:0', grad_fn=<MulBackward0>), 'OCR_real': tensor(8.0757, device='cuda:0', grad_fn=<MeanBackward0>), 'w_fake': tensor(127.7535, device='cuda:0', grad_fn=<MulBackward0>), 'w_real': tensor(6.3174, device='cuda:0', grad_fn=<MeanBackward0>), 'cycle': 0, 'lda1': 0, 'lda2': 0, 'KLD': 0}}

I suggest you to do the same.

yuyukuy534 commented 11 months ago

OK, I'll try it. Thank you very much for your reply and guidance