NVIDIA / vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Other
8.62k stars 1.2k forks source link

Signifcant overhead In the first gpu when performing multi-gpu training #93

Open ligua opened 5 years ago

ligua commented 5 years ago

Hi I am using the script provided for multi-gpu training. However, there seems to be a significant overhead in the first gpu for multi-gpu training. May I ask whether this is normal.

The following is the gpu memory usage(I am training on 4 12GB GeForce GTX TITAN X) memory.used [MiB], memory.free [MiB] 11385 MiB, 822 MiB 2976 MiB, 9231 MiB 2976 MiB, 9231 MiB 2976 MiB, 9231 MiB

The following is the script I use: CUDA_VISIBLE_DEVICES=$GPUS python train.py --name label2city_512 --label_nc 35 --loadSize 256 --use_instance --fg --gpu_ids 0,1,2,3 --n_gpus_gen 3 --n_frames_total 6 --max_frames_per_gpu 1 --debug

May I ask why there seems to be a significant overhead in the first gpu? Is it caused by the dataparallel?

Also, i always got the cuda out of memory bug when I set the loadSize to be 512. May I ask why? Is the script provided initially designed for 24GB GPUs?

Thank you so much. Best regards

ligua commented 5 years ago

I have also attempted the single GPU script which failed to work on 1 GeForce GTX TITAN X. Is the script designed for 24GB GPU?

Thanks and Best Regards

pranavraikote commented 4 years ago

Even I'm facing this issue with 8x NVIDIA V100s in a cloud environment. The first GPU is more than 80% utilized but the rest are highlt under-utilized.

ligua commented 4 years ago

I am guessing there is a sychronization issue in the code.

pranavraikote commented 4 years ago

I am guessing there is a sychronization issue in the code.

Yeah maybe