choderalab / yank

An open, extensible Python framework for GPU-accelerated alchemical free energy calculations.
http://getyank.org
MIT License
179 stars 70 forks source link

Undetected CUDA libraries #1186

Closed dlukauskis closed 5 years ago

dlukauskis commented 5 years ago

I'm trying to use Yank on our local gpu-machine. I've installed Yank via conda and if I try to run the yank selftest, I get the following:

/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py:1170: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  """)

YANK Selftest
-------------
Yank Version 0.24.1 

Available OpenMM platforms:
    0 Reference
    1 CPU
    2 OpenCL

************************************************

Warning! There were OpenMM Platform Load Errors!
************************************************
Error loading library /home/dom/anaconda3/envs/yank/lib/plugins/libOpenMMCUDA.so: libcufft.so.10: cannot open shared object file: No such file or directory
Error loading library /home/dom/anaconda3/envs/yank/lib/plugins/libOpenMMRPMDCUDA.so: libOpenMMCUDA.so: cannot open shared object file: No such file or directory
Error loading library /home/dom/anaconda3/envs/yank/lib/plugins/libOpenMMDrudeCUDA.so: libOpenMMCUDA.so: cannot open shared object file: No such file or directory
Error loading library /home/dom/anaconda3/envs/yank/lib/plugins/libOpenMMAmoebaCUDA.so: libOpenMMCUDA.so: cannot open shared object file: No such file or directory
Error loading library /home/dom/anaconda3/envs/yank/lib/plugins/libOpenMMCudaCompiler.so: libnvrtc.so.10.1: cannot open shared object file: No such file or directory
************************************************
************************************************

Valid OpenEye install not found
Not required, but please check install if you expected it

Checking GPU Computed Mode (if present)...
Found 4 NVIDIA GPUs in the following modes: [Exclusive_Process, Exclusive_Process, Exclusive_Process, Exclusive_Process]
These should all be in shared/Default mode for YANK to use them
YANK Selftest complete.
Thank you for using YANK!

My guess is this has something to do with detecting CUDA libraries, however I've made sure I include these into my .bashrc:

export CUDA_HOME=/usr/local/cuda-10.0
export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0/NsightCompute-2019.1${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:$LD_LIBRARY_PATH

export OPENMM_CUDA_COMPILER=$CUDA_HOME/bin/nvcc

The machine has 4 GTX 1080 cards, Ubuntu 18.04 and no queue system installed. The Nvidia drivers are 410.78 and Cuda version is 10.0.

jchodera commented 5 years ago

I think the latest OpenMM installed by default is built against the latest CUDA 10.1. If you have 10.0, try installing the OpenMM built against that version with

conda install -c omnia/label/cuda100 OpenMM
dlukauskis commented 5 years ago

That did indeed fix the above issue, however when I try to run the yank guest-host example, after minimisation I get:

2019-10-18 15:29:15,977: ********************************************************************************
2019-10-18 15:29:15,977: Iteration 1/500
2019-10-18 15:29:15,977: ********************************************************************************
2019-10-18 15:29:15,977: Single node: executing <function ReplicaExchangeSampler._mix_replicas at 0x7f828fa85158>
2019-10-18 15:29:15,977: Mixing replicas...
2019-10-18 15:29:15,999: Mixing of replicas took    0.022s
2019-10-18 15:29:15,999: Accepted 643720/663552 attempted swaps (97.0%)
2019-10-18 15:29:15,999: Propagating all replicas...
2019-10-18 15:29:15,999: Running _propagate_replica serially.
Traceback (most recent call last):
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/cache.py", line 430, in get_context
    context = self._lru[context_id]
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/cache.py", line 147, in __getitem__
    entry = self._data.pop(key)
KeyError: (-6942422706742036311, 6341839556506253280)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/dom/anaconda3/envs/yank/bin/yank", line 11, in <module>
    load_entry_point('yank==0.24.1', 'console_scripts', 'yank')()
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/cli.py", line 73, in main
    dispatched = getattr(commands, command).dispatch(command_args)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/commands/script.py", line 148, in dispatch
    yaml_builder.run_experiments(write_status=write_status)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py", line 799, in run_experiments
    completed[exp_index] = self._run_experiment(exp, write_status=write_status)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py", line 3158, in _run_experiment
    built_experiment.run(n_iterations=switch_experiment_interval)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/experiment.py", line 476, in run
    alchemical_phase.run(n_iterations=iterations_to_run)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/yank.py", line 1209, in run
    self._sampler.run(n_iterations=n_iterations)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/multistate/multistatesampler.py", line 679, in run
    self._propagate_replicas()
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/utils.py", line 87, in _wrapper
    return func(*args, **kwargs)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/multistate/multistatesampler.py", line 1195, in _propagate_replicas
    send_results_to=0)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/mpi.py", line 512, in distribute
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/mpi.py", line 512, in <listcomp>
    all_results = [task(job_args, *other_args, **kwargs) for job_args in distributed_args]
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/yank/multistate/multistatesampler.py", line 1223, in _propagate_replica
    mcmc_move.apply(thermodynamic_state, sampler_state)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/mcmc.py", line 371, in apply
    move.apply(thermodynamic_state, sampler_state)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/mcmc.py", line 1114, in apply
    super(LangevinDynamicsMove, self).apply(thermodynamic_state, sampler_state)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/mcmc.py", line 655, in apply
    context, integrator = context_cache.get_context(thermodynamic_state, integrator)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/cache.py", line 432, in get_context
    context = thermodynamic_state.create_context(integrator, self._platform)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/openmmtools/states.py", line 1098, in create_context
    return openmm.Context(system, integrator, platform)
  File "/home/dom/anaconda3/envs/yank/lib/python3.7/site-packages/simtk/openmm/openmm.py", line 11125, in __init__
    this = _openmm.new_Context(*args)
Exception: No compatible CUDA device is available
2019-10-18 15:29:16,486: Single node: executing <bound method MultiStateReporter.close of <yank.multistate.multistatereporter.MultiStateReporter object at 0x7f828cf783c8>>

The GPUs are in exclusive mode, is that the issue here? The yank command was preceeded by export CUDA_VISIBLE_DEVICES=3 to make sure it only takes one to run everything in series.

andrrizzi commented 5 years ago

Hi @dlukauskis,

The GPUs are in exclusive mode, is that the issue here?

Yes, shared mode is necessary for efficiency. There may be work around for exclusive, but it will slow things down a lot so it's better to check if you can switch to shared mode first.

dlukauskis commented 5 years ago

@andrrizzi thanks, I'll see if we can try switching to shared mode. Out of curiosity, why was OpenMM designed for GPUs in shared mode? Why not exclusive processes?

andrrizzi commented 5 years ago

OpenMM works with both, but in YANK we create multiple Contexts on the same GPU to speed things up. This is what causes the error as in exclusive mode the NVIDIA driver forbids you to create multiple Contexts on the same GPU, which is instead possible in shared mode.

dlukauskis commented 5 years ago

I see. We have switched to shared mode and it works perfectly. Thanks to you both, @andrrizzi and @jchodera!