Multiple GPU training is slow.

NVlabs / neuralangelo

Official implementation of "Neuralangelo: High-Fidelity Neural Surface Reconstruction" (CVPR 2023)

https://research.nvidia.com/labs/dir/neuralangelo/

Other

4.4k stars 390 forks source link

Multiple GPU training is slow. #57

Open Ziba-li opened 1 year ago

Ziba-li commented 1 year ago

Dear author, Thanks for seeing this question. When I trained toy_example with an eight-card 4090GPU server, I found that the training speed was not very fast. Similar to single card training. And it takes more than 100 hours to complete 500,000 epochs for single-card training. I understand that eight-card distributed training should take more than ten hours to complete, but the actual speed is not much faster. What is the reason for this?

xiemeilong commented 1 year ago

I only need 24 hours to train 500,000 epochs with a single 4090 GPU. However, when I use 4x 4090 GPUs, the training speed is actually only a quarter of using 1 GPU. I have the same issue with unofficial implementations. I guess it's because the model size is so big that the communication overhead between GPUs for syncing gradients is too large.

Ziba-li commented 1 year ago

I only need 24 hours to train 500,000 epochs with a single 4090 GPU. However, when I use 4x 4090 GPUs, the training speed is actually only a quarter of using 1 GPU. I have the same issue with unofficial implementations. I guess it's because the model size is so big that the communication overhead between GPUs for syncing gradients is too large.

Did you run the following script to get the data?

EXPERIMENT_NAME=toy_example
PATH_TO_VIDEO=toy_example.MOV
SKIP_FRAME_RATE=24
SCENE_TYPE=object #{outdoor,indoor,object}
bash projects/neuralangelo/scripts/preprocess.sh ${EXPERIMENT_NAME} ${PATH_TO_VIDEO} ${SKIP_FRAME_RATE} ${SCENE_TYPE}

I ran it according to the process, but I don't understand why it takes 2.3 seconds for an epoch to train on a 4090 GPU. 500,000 epochs will take 319 hours.

xiemeilong commented 1 year ago

I used my own data, only over 500 images.

Ziba-li commented 1 year ago

I used my own data, only over 500 images.

The toy data I used actually only has 29 valid pictures. I don't understand why it takes 319 hours to complete 500,000 epochs.

smandava98 commented 1 year ago

+1. How can I make training faster? It's impossible to run this in multiple videos as each takes so long. I'm looking for faster training for each video

chenhsuanlin commented 1 year ago

Hi @Ziba-li, the multi-GPU setup (i.e. distributed training) enables training with larger batch sizes. It doesn't increase the per-iteration training speed, but it will be much faster to train each epoch. If you want to look into training with less iterations, this may be helpful.

Ziba-li commented 1 year ago

Hi @Ziba-li, the multi-GPU setup (i.e. distributed training) enables training with larger batch sizes. It doesn't increase the per-iteration training speed, but it will be much faster to train each epoch. If you want to look into training with less iterations, this may be helpful.

Thank you for your reply, but I still haven't understood why it takes a very long time for a single GPU to run for 500,000 epochs, instead of running 16 hours for the A100 24G to get the results.

otakudj commented 1 year ago

I have the same problem. When I use single 3090 gpu running the demo lego, it takes ~9secs an epoch. Howerver, when I use 3090x4 to train the model, it takes ~23secs, which is very strange.

prettybot commented 1 year ago

I just tried with A100-PCIE-40GB on AutoDL platform. With single GPU, the training time for lego.mp4 is as the following: img_v2_2509e97c-db3d-4205-b8bd-0b6e803aa97g

For the hyperparameters part, I only update the dict_size from 22 to be 21.

chenhsuanlin commented 1 year ago

I'm not sure about the communication overhead of 4090, but we didn't see such issue with A100. If you could help pinpoint where the additional overhead is coming from (and verifying that it is indeed coming from the gradient synchronization part), I can put up a note on that.

Also a minor note that we are measuring by iterations (500k) instead of epochs in the codebase.

prettybot commented 1 year ago

@chenhsuanlin thanks a lot for your explaination about the 500k part.

uu5208 commented 1 year ago

I just tried with A100-PCIE-40GB on AutoDL platform. With single GPU, the training time for lego.mp4 is as the following:

For the hyperparameters part, I only update the dict_size from 22 to be 21.

Hello, I trained the lego-demo with 2x A100-PCIE-80GB. But I really get the bad time consumption result as below. For each epoch I need 7s.