Closed ajhoffman1229 closed 6 months ago
Thanks for the report!
Can you print the device that's being executed in model.py line 718 ?
model = model.to(device)
Can you try if potential = CHGNet.load(use_device='cuda')
gives the same error?
Including use_device='cuda'
when loading the model doesn't seem to resolve the issue. The device being passed into the model.to()
command is something like cuda:1
, depending on what the output of cuda_devices_sorted_by_free_mem
is.
I think the issue on the HPC I'm using is that users don't have permissions to select the specific cuda device on the node they're allocated, so trying to select the one with the most available memory won't work (although it works great on a local device with GPUs).
Thanks!
This should be fixed in: https://github.com/CederGroupHub/chgnet/commit/dd0dd076279b4c9d33b0cbddec9fd94f67ab4645
Excellent, thanks. I'll also work with the admins on the local cluster to see if there is a particular way I should identify the GPU where a job should be sent. In the meantime, I'll close this issue, I really appreciate your promptness in addressing it!
I am encountering the same issue for but now for the trainer. I belive that cuda_devices_sorted_by_free_mem is the culprit but the PR that solved the problem for the model.py did not address the trainer. Could you please add the same function to remove sorting of cuda devices. Thanks!
@ignaciomigliaro Solved in https://github.com/CederGroupHub/chgnet/commit/0b90f0cd6a6cef03229ece4797dc3a7b4b6ca51b
Hello, I'm having similar issues on my local cluster. I'm trying to use a CHGNet potential to conduct NEB calculations with ASE. I think there's an issue with creating multiple calculators that's giving me this error (when running a job with slurm). Happy to create a new issue if that's preferable:
Traceback (most recent call last):
File "/home/myless/Packages/structure_maker/Scripts/job_script.py", line 125, in <module>
main(base_directory)
File "/home/myless/Packages/structure_maker/Scripts/job_script.py", line 120, in main
create_and_run_neb_files(base_directory, job_path, relax=True, vac_calculator=vac_calculator, neb_calculator=neb_calculator)
File "/home/myless/Packages/structure_maker/Scripts/job_script.py", line 67, in create_and_run_neb_files
barrier.neb_run(num_images=5,
File "/home/myless/Packages/structure_maker/Modules/ModNEB_Barrier.py", line 326, in neb_run
neb_calc = CHGNetCalculator(CHGNet.from_file(potential, use_device=neb_device))
File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/chgnet/model/dynamics.py", line 92, in __init__
self.model = (model or CHGNet.load()).to(self.device)
File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/home/myless/.mambaforge/envs/chgnet-11.7/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
srun: error: gpu-rtx6000-02: task 0: Exited with exit code 1
When using the following test script:
import os
import torch
from ase.io import read
from chgnet.model.model import CHGNet
from chgnet.model.dynamics import CHGNetCalculator
import sys
sys.path.append('/home/myless/Packages/structure_maker/Modules')
from ModNEB_Barrier import NEB_Barrier
def create_and_run_neb_files_single_directory(base_directory, job_path, relax=True, num_images=5, vac_calculator=None, neb_calculator=None):
num_failed = 0
vac_sites = {}
# Get all files in the base directory
files = os.listdir(base_directory)
# Organize files into start and end structures for each vac site
for file in files:
if file.startswith('structure_') and file.endswith('.vasp'):
parts = file.split('_')
vac_site = parts[6]
if vac_site not in vac_sites:
vac_sites[vac_site] = {'start': None, 'end': []}
if 'start' in file:
vac_sites[vac_site]['start'] = file
elif 'end' in file:
vac_sites[vac_site]['end'].append(file)
# Process each vac site
for vac_site, files in vac_sites.items():
start_file = files['start']
end_files = files['end']
if start_file is None or not end_files:
print(f"Skipping vac_site_{vac_site} due to missing start or end files.")
continue
# Read the start structure
start_structure = read(os.path.join(base_directory, start_file))
for end_file in end_files:
# Read the end structure
end_structure = read(os.path.join(base_directory, end_file))
neb_dir = os.path.join(job_path, f'neb_vac_site_{vac_site}_to_{end_file.split("_")[-1].split(".")[0]}')
os.makedirs(neb_dir, exist_ok=True)
if os.path.exists(os.path.join(neb_dir, 'results.json')):
print(f"NEB interpolation for vac_site_{vac_site} to {end_file} already completed.")
continue
# Define the NEB barrier
barrier = NEB_Barrier(start=start_structure,
end=end_structure,
vasp_energies=[0, 0],
composition=start_file.split('_')[3],
structure_number=int(start_file.split('_')[1]),
defect_number=int(vac_site),
direction=end_file.split("_")[-1].split(".")[0],
root_path=neb_dir)
try:
# Run NEB calculations
barrier.neb_run(num_images=num_images,
potential=neb_calculator,
vac_potential=vac_calculator,
run_relax=relax,
num_steps=200)
print(f"Successfully completed NEB for vac_site_{vac_site} to {end_file}.")
except Exception as e:
print(f"Failed to run NEB for vac_site_{vac_site} to {end_file}: {e}")
num_failed += 1
print(f"NEB interpolation completed with {num_failed} failures.")
def main(base_directory):
# Define paths for job output and potential files
job_path = os.path.abspath(base_directory) # Using the base directory for output
vac_pot_path = '/home/myless/Packages/structure_maker/Potentials/Vacancy_Train_Results/bestF_epoch89_e2_f28_s55_mNA.pth.tar'
neb_pot_path = '/home/myless/Packages/structure_maker/Potentials/Jan_26_100_Train_Results/bestF_epoch75_e3_f23_s23_mNA.pth.tar'
# Check CUDA availability
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. Please ensure you are running on a machine with GPUs.")
# Use single GPU
device = torch.device('cuda:0')
print(f"Running on device: {device}")
# Initialize calculators
vac_calculator = CHGNet.from_file(vac_pot_path)
neb_calculator = CHGNet.from_file(neb_pot_path)
# Run NEB calculations
create_and_run_neb_files_single_directory(base_directory, job_path, relax=True, vac_calculator=vac_calculator, neb_calculator=neb_calculator)
if __name__ == '__main__':
import sys
base_directory = os.path.abspath(sys.argv[1])
main(base_directory)
I get the following error out:
Running on device: cuda:0
CHGNet v0.3.0 initialized with 412,525 parameters
CHGNet v0.3.0 initialized with 412,525 parameters
Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_47.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_51.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_69.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_7.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Number of CUDA devices available: 1
Failed to run NEB for vac_site_10 to structure_0_comp_Ti22V80Cr23_vac_site_10_end_site_85.vasp: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
NEB interpolation completed with 5 failures.
Email (Optional)
ajhoff29@mit.edu
Version
v0.3.4
Which OS(es) are you using?
What happened?
I am trying to run an optimization with CHGNet on a local compute cluster, but the calculations are failing ~90% of the time. I am getting errors like the one in the log output below when
CHGNet.load()
tries to send the model to a cuda device. It seems like on the local HPC cluster, I can't request a specific cuda device, butCHGNet.load()
is trying to do that anyway within the load function (lines 709-718 inmodel/model.py
, reproduced below in the code snippet). I have not had this issue with the MACE calculator, which simply usesdevice = "cuda"
.Could the
model = model.to(device)
statement be wrapped in a try/except that catches theRunTimeError
if this type of allocation is not permitted? Thanks so much!Code snippet
Log output
Code of Conduct