Closed thinkerthinker closed 5 years ago
I'm not sure what caused your problem. I trained HQ-384 with --batch_size=4
on four Nvidia GTX 1080 Ti. This means each GPU has one image tensor. My GPU usage is:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A |
| 29% 57C P2 82W / 250W | 11153MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:17:00.0 Off | N/A |
| 41% 74C P2 79W / 250W | 7865MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 108... Off | 00000000:65:00.0 Off | N/A |
| 35% 60C P2 92W / 250W | 7865MiB / 11176MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
| 3 GeForce GTX 108... Off | 00000000:B3:00.0 Off | N/A |
| 29% 64C P2 79W / 250W | 7865MiB / 11178MiB | 100% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 5437 C python3 11143MiB |
| 1 5437 C python3 7855MiB |
| 2 5437 C python3 7855MiB |
| 3 5437 C python3 7855MiB |
+-----------------------------------------------------------------------------+
Did you set --batch_size
as the same size as the number of your GPUs?
Thanks, I can run it, but the network still doesn't work. Have you modified the parameters?
What do you mean it doesn't work? Are your network not converging? For example, it still had extremely high losses of G and D after an epoch training.
My dc_loss and gc_loss don't drop, and gf_loss is very large. The attributes of the generated image have not changed.
In my experiment, I trained it with a batch size of 8 on 4 GPUs. The attributes are not clear until the 8th epoch. It looks like you've trained 8,000 iterations, which is around 1 epoch if you also trained with a batch size of 4. It took me one and a half days on 4 GPUs. Please be patient.
epoch = iteration * batch_size
To be honest, if you have only two GPUs. Multi-GPU training is much less efficient than single-GPU training. I recommend to use multi-GPU if you have at least four GPUs.
p.s. The following is my training history of 384_shortcut1_inject1_none_hq
:
Thank you very much for your advice. I am too eager to see the results. I will train patiently.
Hello ! I use multi_gpu to train 384*384 pictures on celeba-hq, but I still get an error when I train only one picture per gpu: RuntimeError: CUDA out of memory. I used NVIDIA Corporation GP102 [TITAN Xp].