Describe the bug
When trying to train a model in a slurm environment, all subprocesses except for the one with local_rank==0 fail due to a RuntimeError: CUDA error: invalid device ordinal error.
To see if it might be an issue with incorrectly set local_rank, I printed it before and after setting the the device like this:
if args.distributed:
print(f"Trying to set device to {local_rank}")
torch.cuda.set_device(local_rank)
print(f"Device set to {local_rank}")
which gave the expected outputs with local rank in a range of 0 to 7, however only the one with local_rank==0 successfully continued.
As a result, the whole training fails to proceed.
Other neural networks trained (usually trained using pytorch-lightning) work fine in the slurm environment using multiple GPUs, as does training the MACE model on a single GPU.
EDIT: I Believe this is resolved for me. The mistake was on my side, specifying --gpus-per-task instead of --gpus-per-node however, I think this should either be mentioned in the documentation or be handled by the code.
Describe the bug When trying to train a model in a slurm environment, all subprocesses except for the one with
local_rank==0
fail due to aRuntimeError: CUDA error: invalid device ordinal
error.To see if it might be an issue with incorrectly set local_rank, I printed it before and after setting the the device like this:
which gave the expected outputs with local rank in a range of 0 to 7, however only the one with
local_rank==0
successfully continued.As a result, the whole training fails to proceed.
Other neural networks trained (usually trained using pytorch-lightning) work fine in the slurm environment using multiple GPUs, as does training the MACE model on a single GPU.
slurm script used:
EDIT: I Believe this is resolved for me. The mistake was on my side, specifying
--gpus-per-task
instead of--gpus-per-node
however, I think this should either be mentioned in the documentation or be handled by the code.