OpenFreeEnergy / openfe

The Open Free Energy toolkit
https://docs.openfree.energy
MIT License
124 stars 14 forks source link

CalledProcessError: 9 #854

Open raitis-b opened 1 month ago

raitis-b commented 1 month ago

Hi,

I tried to run openFE tutorial on my laptop and everything worked just fine, but when I tried to run it on our cluster I faced an issue. On a gpu node it gave an error that has been mentioned before (GPU in 'Exclusive_Process' mode (or Prohibited), one context is allowed per device. This may prevent some openmmtools features from working. GPU must be in 'Default' compute mode). While we fix this issue, I wanted to run it without the gpu, but this led to another error:

$ openfe quickrun transformations/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent.json -o results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node.json -d results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node

Loading file... Planning simulations for this edge... Starting the simulations for this edge... Done with all simulations! Analyzing the results.... Here is the result: dG = None ± None

Error: The protocol unit 'lig_ejm_31 to lig_ejm_46 repeat 2 generation 0' failed with the error message: CalledProcessError: 9

Details provided in output.

The only output is the .json file that is attached.

Cheers, Raitis

easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_no_gpu.json

mikemhenry commented 1 month ago

@raitis-b

Thank you for the bug report! Looking at the json file and cleaning it up a bit (I just used firefox to view it, it does a decent job rendering these json files) it looks like

Traceback (most recent call last):
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/gufe/protocols/protocolunit.py", line 320, in execute
    outputs = self._execute(context, **inputs)
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 1127, in _execute
    log_system_probe(logging.INFO, paths=[ctx.scratch])
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 502, in log_system_probe
    sysinfo = _probe_system(pl_paths)['system information']
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 471, in _probe_system
    gpu_info = _get_gpu_info()
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 340, in _get_gpu_info
    nvidia_smi_output = subprocess.check_output(
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.

the nvidia-smi command failed. Could you run nvidia-smi on the same machine/node where you ran the simulation and report back what it does? Code 9 is sigkill so I think that command got killed by some other process.

Regardless, we want to make sure this command doesn't prevent a simulation from running, so we need to enhance our error handling of it.

raitis-b commented 1 month ago

When I am not asking for the GPU in the queuing script and want to run only on the CPU, the nvidia-smi output is: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.