Closed dlukauskis closed 5 years ago
I think the latest OpenMM installed by default is built against the latest CUDA 10.1. If you have 10.0, try installing the OpenMM built against that version with
conda install -c omnia/label/cuda100 OpenMM
That did indeed fix the above issue, however when I try to run the yank guest-host example, after minimisation I get:
2019-10-18 15:29:15,977: ********************************************************************************
2019-10-18 15:29:15,977: Iteration 1/500
2019-10-18 15:29:15,977: ********************************************************************************
2019-10-18 15:29:15,977: Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x7f828fa85158>
2019-10-18 15:29:15,977: Mixing replicas...
2019-10-18 15:29:15,999: Mixing of replicas took 0.022s
2019-10-18 15:29:15,999: Accepted 643720/663552 attempted swaps (97.0%)
2019-10-18 15:29:15,999: Propagating all replicas...
2019-10-18 15:29:15,999: Running _propagate_replica serially.
Traceback (most recent call last):
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/cache.py", line 430, in get_context
context = self._lru[context_id]
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/cache.py", line 147, in __getitem__
entry = self._data.pop(key)
KeyError: (-6942422706742036311, 6341839556506253280)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/dom/anaconda3/envs/yank/bin/yank", line 11, in <module>
load_entry_point('yank==0.24.1', 'console_scripts', 'yank')()
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/cli.py", line 73, in main
dispatched = getattr(commands, command).dispatch(command_args)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/commands/script.py", line 148, in dispatch
yaml_builder.run_experiments(write_status=write_status)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py", line 799, in run_experiments
completed[exp_index] = self._run_experiment(exp, write_status=write_status)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py", line 3158, in _run_experiment
built_experiment.run(n_iterations=switch_experiment_interval)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py", line 476, in run
alchemical_phase.run(n_iterations=iterations_to_run)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/yank.py", line 1209, in run
self._sampler.run(n_iterations=n_iterations)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/multistate/multistatesampler.py", line 679, in run
self._propagate_replicas()
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/utils.py", line 87, in _wrapper
return func(*args, **kwargs)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/multistate/multistatesampler.py", line 1195, in _propagate_replicas
send_results_to=0)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/mpi.py", line 512, in distribute
all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/mpi.py", line 512, in <listcomp>
all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/multistate/multistatesampler.py", line 1223, in _propagate_replica
mcmc_move.apply(thermodynamic_state, sampler_state)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/mcmc.py", line 371, in apply
move.apply(thermodynamic_state, sampler_state)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/mcmc.py", line 1114, in apply
super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/mcmc.py", line 655, in apply
context, integrator = context_cache.get_context(thermodynamic_state, integrator)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/cache.py", line 432, in get_context
context = thermodynamic_state.create_context(integrator, self._platform)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/states.py", line 1098, in create_context
return openmm.Context(system, integrator, platform)
File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/simtk/openmm/openmm.py", line 11125, in __init__
this = _openmm.new_Context(*args)
Exception: No compatible CUDA device is available
2019-10-18 15:29:16,486: Single node: executing <bound method MultiStateReporter.close of <yank.multistate.multistatereporter.MultiStateReporter object at 0x7f828cf783c8>>
The GPUs are in exclusive mode, is that the issue here? The yank command was preceeded by export CUDA_VISIBLE_DEVICES=3
to make sure it only takes one to run everything in series.
Hi @dlukauskis,
The GPUs are in exclusive mode, is that the issue here?
Yes, shared mode is necessary for efficiency. There may be work around for exclusive, but it will slow things down a lot so it's better to check if you can switch to shared mode first.
@andrrizzi thanks, I'll see if we can try switching to shared mode. Out of curiosity, why was OpenMM designed for GPUs in shared mode? Why not exclusive processes?
OpenMM works with both, but in YANK we create multiple Context
s on the same GPU to speed things up. This is what causes the error as in exclusive mode the NVIDIA driver forbids you to create multiple Context
s on the same GPU, which is instead possible in shared mode.
I see. We have switched to shared mode and it works perfectly. Thanks to you both, @andrrizzi and @jchodera!
I'm trying to use Yank on our local gpu-machine. I've installed Yank via conda and if I try to run the
yank selftest
, I get the following:My guess is this has something to do with detecting CUDA libraries, however I've made sure I include these into my .bashrc:
The machine has 4 GTX 1080 cards, Ubuntu 18.04 and no queue system installed. The Nvidia drivers are 410.78 and Cuda version is 10.0.