Closed kkranen closed 1 year ago
A proposed solution would be to leave the default as -1, and infer at runtime once distprocess is initialized which GPUs are visible and therefore should be used.
The number of training processes on each node is set by num_training_procs
which is for both CPU and GPU training.
For GPU training, CUDA_VISIBLE_DEVICES
is used to control which devices your application uses, and in this case, the number of CUDA_VISIBLE_DEVICES
is the same as num_training_procs
.
Yes, however, there's no warning or resetting if CUDA_VISIBLE_DEVICES
< num_training_procs
which results in an error.
The check has been added in #24
🐛 Describe the bug
Currently, the IGBH example sets the number of spawned processes based on solely an argparse variable. In the case of MultiGPU training, this is not quite intuitive, as from the example calls, it seems that the script is inferring the number of processes on GPU via CUDA_VISIBLE_DEVICES. However, this is not true, as this value is only derived from the argparse input, with 2 as default.
https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/dist_train_rgnn.py#L254
Environment