Open raitis-b opened 1 month ago
@raitis-b
Thank you for the bug report! Looking at the json file and cleaning it up a bit (I just used firefox to view it, it does a decent job rendering these json files) it looks like
Traceback (most recent call last):
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/gufe/protocols/protocolunit.py", line 320, in execute
outputs = self._execute(context, **inputs)
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/protocols/openmm_rfe/equil_rfe_methods.py", line 1127, in _execute
log_system_probe(logging.INFO, paths=[ctx.scratch])
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 502, in log_system_probe
sysinfo = _probe_system(pl_paths)['system information']
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 471, in _probe_system
gpu_info = _get_gpu_info()
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/site-packages/openfe/utils/system_probe.py", line 340, in _get_gpu_info
nvidia_smi_output = subprocess.check_output(
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/mnt/home/bobrovs/software/miniforge3/envs/openfe_env/lib/python3.12/subprocess.py", line 571, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['nvidia-smi', '--query-gpu=gpu_uuid,gpu_name,compute_mode,pstate,temperature.gpu,utilization.memory,memory.total,driver_version,', '--format=csv']' returned non-zero exit status 9.
the nvidia-smi
command failed. Could you run nvidia-smi
on the same machine/node where you ran the simulation and report back what it does? Code 9 is sigkill
so I think that command got killed by some other process.
Regardless, we want to make sure this command doesn't prevent a simulation from running, so we need to enhance our error handling of it.
When I am not asking for the GPU in the queuing script and want to run only on the CPU, the nvidia-smi output is: NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Hi,
I tried to run openFE tutorial on my laptop and everything worked just fine, but when I tried to run it on our cluster I faced an issue. On a gpu node it gave an error that has been mentioned before (GPU in 'Exclusive_Process' mode (or Prohibited), one context is allowed per device. This may prevent some openmmtools features from working. GPU must be in 'Default' compute mode). While we fix this issue, I wanted to run it without the gpu, but this led to another error:
$ openfe quickrun transformations/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent.json -o results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node.json -d results/easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_login_node
Loading file... Planning simulations for this edge... Starting the simulations for this edge... Done with all simulations! Analyzing the results.... Here is the result: dG = None ± None
Error: The protocol unit 'lig_ejm_31 to lig_ejm_46 repeat 2 generation 0' failed with the error message: CalledProcessError: 9
Details provided in output.
The only output is the .json file that is attached.
Cheers, Raitis
easy_rbfe_lig_ejm_31_solvent_lig_ejm_46_solvent_no_gpu.json