DeepGraphLearning / NBFNet

Official implementation of Neural Bellman-Ford Networks (NeurIPS 2021)
MIT License
197 stars 29 forks source link

Seems unable to utilize multiple GPUs #11

Open jerermyyoung opened 2 years ago

jerermyyoung commented 2 years ago

Hi there.

I have tried running this code on one of my machine with four RTX3090 GPUs (GPU memory 24GB for each)

python -m torch.distributed.launch --nproc_per_node=4 script/run.py -c config/inductive/wn18rr.yaml --gpus [0,1,2,3]

I do not change any other parts of this repo. However, I encountered the CUDA error saying that I need more GPU memory. Later I modified this code as follows:

python script/run.py -c config/inductive/wn18rr.yaml --gpus [0]

and run it on a machine with one A100 GPU with 40GB GPU memory. The code runs successfully and costs roughly 32GB GPU memory. I am really puzzled for this: why the code does not properly utilize the total 24GB*4=96GB GPU memory and still report a memory issue? Is there something wrong with my setups?

KiddoZhu commented 1 year ago

Hi! Sorry for the late reply.

In the multi-GPU setup, the batch size is proportional to the number of GPUs. That is, each GPU uses the same batch size (and thus the same GPU memory) as the single-GPU case. Since our default hyperparameter configuration is tuned with 32GB V100 GPUs, it is possible that the configuration can't fit into 24GB GPU memory. You may reduce the batch size to fit it into 24GB GPU memory.