multi-gpu training fails

ShoufaChen commented 2 years ago

Hello,

Running

python -m torch.distributed.launch --nproc_per_node=4 script/downstream.py -c config/downstream/EC/gearnet.yaml --gpus [0,1,2,3]

does not succeed with following log:

20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
20:21:09   Extracting /home/chenshoufa/scratch/protein-datasets/EnzymeCommission.zip to /home/chenshoufa/scratch/protein-datasets
Loading /home/chenshoufa/scratch/protein-datasets/EnzymeCommission/enzyme_commission.pkl.gz:  64%|██████████████████████████████████████████▉                        | 11854/18515 [08:49<20:55,  5.30it/s]Killing subprocess 1350247
Killing subprocess 1350248
Killing subprocess 1350249
Killing subprocess 1350250
Traceback (most recent call last):
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/chenshoufa/anaconda3/envs/gear/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/chenshoufa/anaconda3/envs/gear/bin/python', '-u', 'script/downstream.py', '--local_rank=3', '-c', 'config/downstream/EC/gearnet.yaml', '--gpus', '[0,1,2,3]']' died with <Signals.SIGKILL: 9>.

Could you help me with this issue?

ShoufaChen commented 2 years ago

Hello, @Oxer11

When using 4 GPUs, it seems that the Memory will be used out at the loading data stage.

Oxer11 commented 2 years ago

Hi Shoufa!

Thanks for raising this issue! I think this is because loading the whole dataset four times will take very large memory. Here I suggest:

use a machine with larger cpu (240G should be enough)
use 2 gpus instead of 4 gpus
try to turn on the lazy option when loading dataset (this will avoid loading the whole dataset into cpu)

ShoufaChen commented 2 years ago

Hi @Oxer11 ,

Thanks for your reply. I was wondering if it is necessary to load independent data for each process, ie, is it possible to let all processes share the loaded data?

ShoufaChen commented 2 years ago

Hi, @Oxer11

How much memory does GearNet need for the AlphaFold dataset at the pertaining stage?

Oxer11 commented 2 years ago

Hi!

The memory of our cluster is 500G, which is enough for loading EC and AF DB splits four times. This protocol follows the module-level data parallelism in Pytorch. To save memory, you can shrink the size of each split in AF DB.

DeepGraphLearning / GearNet

multi-gpu training fails #2