alibaba / graphlearn-for-pytorch

A GPU-accelerated graph learning library for PyTorch, facilitating the scaling of GNN training and inference.
Apache License 2.0
113 stars 34 forks source link

Examples have fixed default number of processes #22

Closed kkranen closed 1 year ago

kkranen commented 1 year ago

🐛 Describe the bug

Currently, the IGBH example sets the number of spawned processes based on solely an argparse variable. In the case of MultiGPU training, this is not quite intuitive, as from the example calls, it seems that the script is inferring the number of processes on GPU via CUDA_VISIBLE_DEVICES. However, this is not true, as this value is only derived from the argparse input, with 2 as default.

https://github.com/alibaba/graphlearn-for-pytorch/blob/main/examples/igbh/dist_train_rgnn.py#L254

Environment

kkranen commented 1 year ago

A proposed solution would be to leave the default as -1, and infer at runtime once distprocess is initialized which GPUs are visible and therefore should be used.

baoleai commented 1 year ago

The number of training processes on each node is set by num_training_procs which is for both CPU and GPU training. For GPU training, CUDA_VISIBLE_DEVICES is used to control which devices your application uses, and in this case, the number of CUDA_VISIBLE_DEVICES is the same as num_training_procs.

kkranen commented 1 year ago

Yes, however, there's no warning or resetting if CUDA_VISIBLE_DEVICES < num_training_procs which results in an error.

baoleai commented 1 year ago

The check has been added in #24