ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
412 stars 155 forks source link

Distributed Training: invalid device ordinal #435

Closed dbitterlich closed 1 week ago

dbitterlich commented 1 month ago

Describe the bug When trying to train a model in a slurm environment, all subprocesses except for the one with local_rank==0 fail due to a RuntimeError: CUDA error: invalid device ordinal error.

To see if it might be an issue with incorrectly set local_rank, I printed it before and after setting the the device like this:

    if args.distributed:
        print(f"Trying to set device to {local_rank}")
        torch.cuda.set_device(local_rank)
        print(f"Device set to {local_rank}")

which gave the expected outputs with local rank in a range of 0 to 7, however only the one with local_rank==0 successfully continued.

As a result, the whole training fails to proceed.

Other neural networks trained (usually trained using pytorch-lightning) work fine in the slurm environment using multiple GPUs, as does training the MACE model on a single GPU.

slurm script used:

#!/bin/bash
#sbatch --nodes=1
#SBATCH --ntasks-per-node=8
#SBATCH --ntasks=8
#SBATCH --mem=100G
#SBATCH --cpus-per-task=8
#SBATCH --time=48:00:00
#SBATCH --gpus-per-task=1
#SBATCH --job-name=mace_distr

export NUMEXPR_MAX_THREADS=$SLURM_CPUS_PER_TASK

srun mace_run_train --distributed \ #other parameters

EDIT: I Believe this is resolved for me. The mistake was on my side, specifying --gpus-per-task instead of --gpus-per-node however, I think this should either be mentioned in the documentation or be handled by the code.