I have problems with celeba-hq 384*384

thinkerthinker commented 5 years ago

Hello ! I use multi_gpu to train 384*384 pictures on celeba-hq, but I still get an error when I train only one picture per gpu: RuntimeError: CUDA out of memory. I used NVIDIA Corporation GP102 [TITAN Xp].

elvisyjlin commented 5 years ago

I'm not sure what caused your problem. I trained HQ-384 with --batch_size=4 on four Nvidia GTX 1080 Ti. This means each GPU has one image tensor. My GPU usage is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 29%   57C    P2    82W / 250W |  11153MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:17:00.0 Off |                  N/A |
| 41%   74C    P2    79W / 250W |   7865MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 108...  Off  | 00000000:65:00.0 Off |                  N/A |
| 35%   60C    P2    92W / 250W |   7865MiB / 11176MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 108...  Off  | 00000000:B3:00.0 Off |                  N/A |
| 29%   64C    P2    79W / 250W |   7865MiB / 11178MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      5437      C   python3                                    11143MiB |
|    1      5437      C   python3                                     7855MiB |
|    2      5437      C   python3                                     7855MiB |
|    3      5437      C   python3                                     7855MiB |
+-----------------------------------------------------------------------------+

Did you set --batch_size as the same size as the number of your GPUs?

thinkerthinker commented 5 years ago

Thanks, I can run it, but the network still doesn't work. Have you modified the parameters?

elvisyjlin commented 5 years ago

What do you mean it doesn't work? Are your network not converging? For example, it still had extremely high losses of G and D after an epoch training.

thinkerthinker commented 5 years ago

384 My dc_loss and gc_loss don't drop, and gf_loss is very large. The attributes of the generated image have not changed.

elvisyjlin commented 5 years ago

In my experiment, I trained it with a batch size of 8 on 4 GPUs. The attributes are not clear until the 8th epoch. It looks like you've trained 8,000 iterations, which is around 1 epoch if you also trained with a batch size of 4. It took me one and a half days on 4 GPUs. Please be patient.

epoch = iteration * batch_size

To be honest, if you have only two GPUs. Multi-GPU training is much less efficient than single-GPU training. I recommend to use multi-GPU if you have at least four GPUs.

p.s. The following is my training history of 384_shortcut1_inject1_none_hq:

thinkerthinker commented 5 years ago

Thank you very much for your advice. I am too eager to see the results. I will train patiently.

elvisyjlin / AttGAN-PyTorch

I have problems with celeba-hq 384*384 #8