[Bug]: issue with cuda device allocation on HPC

ajhoffman1229 commented 6 months ago

Email (Optional)

ajhoff29@mit.edu

Version

v0.3.4

Which OS(es) are you using?

[ ] MacOS
[ ] Windows
[X] Linux

What happened?

I am trying to run an optimization with CHGNet on a local compute cluster, but the calculations are failing ~90% of the time. I am getting errors like the one in the log output below when CHGNet.load() tries to send the model to a cuda device. It seems like on the local HPC cluster, I can't request a specific cuda device, but CHGNet.load() is trying to do that anyway within the load function (lines 709-718 in model/model.py, reproduced below in the code snippet). I have not had this issue with the MACE calculator, which simply uses device = "cuda".

Could the model = model.to(device) statement be wrapped in a try/except that catches the RunTimeError if this type of allocation is not permitted? Thanks so much!

Code snippet

# Determine the device to use
        if use_device == "mps" and torch.backends.mps.is_available():
            device = "mps"
        else:
            device = use_device or ("cuda" if torch.cuda.is_available() else "cpu")
            if device == "cuda":
                device = f"cuda:{cuda_devices_sorted_by_free_mem()[-1]}"
        # Move the model to the specified device
        model = model.to(device)

Log output

Traceback (most recent call last):
  File "/home/gridsan/ahoffman/htvs/djangochem/asecalcs/ase_calc_opt.py", line 87, in <module>
    opt = AseOpt.from_file(args.paramsfile)
  File "/home/gridsan/ahoffman/htvs/djangochem/asecalcs/ase_calc.py", line 78, in from_file
    return cls(jobspec)
  File "/home/gridsan/ahoffman/htvs/djangochem/asecalcs/ase_calc.py", line 70, in __init__
    self.calculator = self.get_calculator()
  File "/home/gridsan/ahoffman/htvs/djangochem/asecalcs/ase_calc.py", line 147, in get_calculator
    potential = CHGNet.load()
  File "/home/gridsan/ahoffman/.conda/envs/htvs/lib/python3.9/site-packages/chgnet/model/model.py", line 718, in load
    model = model.to(device)
  File "/home/gridsan/ahoffman/.conda/envs/htvs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1160, in to
    return self._apply(convert)
  File "/home/gridsan/ahoffman/.conda/envs/htvs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/gridsan/ahoffman/.conda/envs/htvs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 810, in _apply
    module._apply(fn)
  File "/home/gridsan/ahoffman/.conda/envs/htvs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 833, in _apply
    param_applied = fn(param)
  File "/home/gridsan/ahoffman/.conda/envs/htvs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1158, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Code of Conduct

[X] I agree to follow this project's Code of Conduct

BowenD-UCB commented 6 months ago

Thanks for the report! Can you print the device that's being executed in model.py line 718 ? model = model.to(device)

Can you try if potential = CHGNet.load(use_device='cuda') gives the same error?

ajhoffman1229 commented 6 months ago

Including use_device='cuda' when loading the model doesn't seem to resolve the issue. The device being passed into the model.to() command is something like cuda:1, depending on what the output of cuda_devices_sorted_by_free_mem is.

I think the issue on the HPC I'm using is that users don't have permissions to select the specific cuda device on the node they're allocated, so trying to select the one with the most available memory won't work (although it works great on a local device with GPUs).

Thanks!

BowenD-UCB commented 6 months ago

This should be fixed in: https://github.com/CederGroupHub/chgnet/commit/dd0dd076279b4c9d33b0cbddec9fd94f67ab4645

ajhoffman1229 commented 6 months ago

Excellent, thanks. I'll also work with the admins on the local cluster to see if there is a particular way I should identify the GPU where a job should be sent. In the meantime, I'll close this issue, I really appreciate your promptness in addressing it!

ignaciomigliaro commented 6 months ago

I am encountering the same issue for but now for the trainer. I belive that cuda_devices_sorted_by_free_mem is the culprit but the PR that solved the problem for the model.py did not address the trainer. Could you please add the same function to remove sorting of cuda devices. Thanks!

BowenD-UCB commented 6 months ago

@ignaciomigliaro Solved in https://github.com/CederGroupHub/chgnet/commit/0b90f0cd6a6cef03229ece4797dc3a7b4b6ca51b

mstapelberg commented 1 month ago

Hello, I'm having similar issues on my local cluster. I'm trying to use a CHGNet potential to conduct NEB calculations with ASE. I think there's an issue with creating multiple calculators that's giving me this error (when running a job with slurm). Happy to create a new issue if that's preferable:

Traceback (most recent call last):
  File "/home/myless/Packages/structure_maker/Scripts/job_script.py", line 125, in <module>
    main(base_directory)
  File "/home/myless/Packages/structure_maker/Scripts/job_script.py", line 120, in main
    create_and_run_neb_files(base_directory, job_path, relax=True, vac_calculator=vac_calculator, neb_calculator=neb_calculator)
  File "/home/myless/Packages/structure_maker/Scripts/job_script.py", line 67, in create_and_run_neb_files
    barrier.neb_run(num_images=5,
  File "/home/myless/Packages/structure_maker/Modules/ModNEB_Barrier.py", line 326, in neb_run
    neb_calc = CHGNetCalculator(CHGNet.from_file(potential, use_device=neb_device))
  File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/chgnet/model/dynamics.py", line 92, in __init__
    self.model = (model or CHGNet.load()).to(self.device)
  File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

srun: error: gpu-rtx6000-02: task 0: Exited with exit code 1

When using the following test script:

import os
import torch
from ase.io import read
from chgnet.model.model import CHGNet
from chgnet.model.dynamics import CHGNetCalculator
import sys
sys.path.append('/home/myless/Packages/structure_maker/Modules')
from ModNEB_Barrier import NEB_Barrier

def create_and_run_neb_files_single_directory(base_directory, job_path, relax=True, num_images=5, vac_calculator=None, neb_calculator=None):
    num_failed = 0
    vac_sites = {}

    # Get all files in the base directory
    files = os.listdir(base_directory)

    # Organize files into start and end structures for each vac site
    for file in files:
        if file.startswith('structure_') and file.endswith('.vasp'):
            parts = file.split('_')
            vac_site = parts[6]
            if vac_site not in vac_sites:
                vac_sites[vac_site] = {'start': None, 'end': []}
            if 'start' in file:
                vac_sites[vac_site]['start'] = file
            elif 'end' in file:
                vac_sites[vac_site]['end'].append(file)

    # Process each vac site
    for vac_site, files in vac_sites.items():
        start_file = files['start']
        end_files = files['end']

        if start_file is None or not end_files:
            print(f"Skipping vac_site_{vac_site} due to missing start or end files.")
            continue

        # Read the start structure
        start_structure = read(os.path.join(base_directory, start_file))

        for end_file in end_files:
            # Read the end structure
            end_structure = read(os.path.join(base_directory, end_file))
            neb_dir = os.path.join(job_path, f'neb_vac_site_{vac_site}_to_{end_file.split("_")[-1].split(".")[0]}')
            os.makedirs(neb_dir, exist_ok=True)

            if os.path.exists(os.path.join(neb_dir, 'results.json')):
                print(f"NEB interpolation for vac_site_{vac_site} to {end_file} already completed.")
                continue

            # Define the NEB barrier
            barrier = NEB_Barrier(start=start_structure,
                                  end=end_structure,
                                  vasp_energies=[0, 0],
                                  composition=start_file.split('_')[3],
                                  structure_number=int(start_file.split('_')[1]),
                                  defect_number=int(vac_site),
                                  direction=end_file.split("_")[-1].split(".")[0],
                                  root_path=neb_dir)

            try:
                # Run NEB calculations
                barrier.neb_run(num_images=num_images,
                                potential=neb_calculator,
                                vac_potential=vac_calculator,
                                run_relax=relax,
                                num_steps=200)
                print(f"Successfully completed NEB for vac_site_{vac_site} to {end_file}.")
            except Exception as e:
                print(f"Failed to run NEB for vac_site_{vac_site} to {end_file}: {e}")
                num_failed += 1

    print(f"NEB interpolation completed with {num_failed} failures.")

def main(base_directory):
    # Define paths for job output and potential files
    job_path = os.path.abspath(base_directory)  # Using the base directory for output
    vac_pot_path = '/home/myless/Packages/structure_maker/Potentials/Vacancy_Train_Results/bestF_epoch89_e2_f28_s55_mNA.pth.tar'
    neb_pot_path = '/home/myless/Packages/structure_maker/Potentials/Jan_26_100_Train_Results/bestF_epoch75_e3_f23_s23_mNA.pth.tar'

    # Check CUDA availability
    if not torch.cuda.is_available():
        raise RuntimeError("CUDA is not available. Please ensure you are running on a machine with GPUs.")

    # Use single GPU
    device = torch.device('cuda:0')
    print(f"Running on device: {device}")

    # Initialize calculators
    vac_calculator = CHGNet.from_file(vac_pot_path)
    neb_calculator = CHGNet.from_file(neb_pot_path)

    # Run NEB calculations
    create_and_run_neb_files_single_directory(base_directory, job_path, relax=True, vac_calculator=vac_calculator, neb_calculator=neb_calculator)

if __name__ == '__main__':
    import sys
    base_directory = os.path.abspath(sys.argv[1])
    main(base_directory)

I get the following error out:

Running on device: cuda:0
CHGNet v0.3.0 initialized with 412,525 parameters
CHGNet v0.3.0 initialized with 412,525 parameters
Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_47.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_51.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_69.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_7.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_85.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

NEB interpolation completed with 5 failures.

CederGroupHub / chgnet