choderalab / ensembler

Automated omics-scale protein modeling and simulation setup.
http://ensembler.readthedocs.io/
GNU General Public License v2.0
52 stars 21 forks source link

Having trouble with implicit refinement #66

Closed steven-albanese closed 8 years ago

steven-albanese commented 8 years ago

Trying to run the implicit refinement in parallel on the cluster using the following script:

#!/bin/bash                                                                                                                               
#  Batch script for mpirun job on cbio cluster.                                                                                           
#                                                                                                                                         
#                                                                                                                                         
# walltime : maximum wall clock time (hh:mm:ss)                                                                                           
#PBS -l walltime=24:00:00                                                                                                                 
#                                                                                                                                         
# join stdout and stderr                                                                                                                  
#PBS -j oe                                                                                                                                
#                                                                                                                                         
# spool output immediately                                                                                                                
#PBS -k oe                                                                                                                                
#                                                                                                                                         
# specify queue                                                                                                                           
#PBS -q gpu                                                                                                                               
#                                                                                                                                         
# nodes: number of 8-core nodes                                                                                                           
#   ppn: how many cores per node to use (1 through 8)                                                                                     
#       (you are always charged for the entire node)                                                                                      
#PBS -l nodes=4:ppn=4:gpus=4:shared                                                                                                       
#                                                                                                                                         
# export all my environment variables to the job                                                                                          
##PBS -V                                                                                                                                  
#                                                                                                                                         
# job name (default = name of script file)                                                                                                
#PBS -N implicit-refinement                                                                                                                
#                                                                                                                                          
#specifcy email for notifications                                                                                                         
#PBS -M steven.albanese@choderalab.org                                                                                                     

cd /cbio/jclab/home/albaness/ensembler/BRAF

module load cuda/6.5

build_mpirun_configfile --mpitype conda ensembler refine_implicit

mpirun -configfile configfile

It's modified from @sonyahanson's in dansu-dansu. I seem to be having some trouble with only a small subset of the processes. The log is shown below.

Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
Auto-selected OpenMM platform: CUDA
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4KSQ_B in implicit solvent for 100.0 ps (MPI rank: 2, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4MNE_G in implicit solvent for 100.0 ps (MPI rank: 7, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4MNE_C in implicit solvent for 100.0 ps (MPI rank: 5, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4MBJ_A in implicit solvent for 100.0 ps (MPI rank: 3, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4KSQ_A in implicit solvent for 100.0 ps (MPI rank: 1, GPU ID: 0)
-------------------------------------------------------------------------
/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 7 hostname gpu-2-15.local gpuid 0 =
Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

Traceback (most recent call last):
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 110, in simulate_implicit_md
    modeller.addHydrogens(forcefield, pH=ph, variants=reference_variants)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/app/modeller.py", line 853, in addHydrogens
    context = Context(system, VerletIntegrator(0.0))
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15050, in __init__
    this = _openmm.new_Context(*args)
Exception: Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

= ERROR end: MPI rank 7 hostname gpu-2-15.local gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 5 hostname gpu-2-15.local gpuid 0 =
Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

Traceback (most recent call last):
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 110, in simulate_implicit_md
    modeller.addHydrogens(forcefield, pH=ph, variants=reference_variants)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/app/modeller.py", line 853, in addHydrogens
    context = Context(system, VerletIntegrator(0.0))
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15050, in __init__
    this = _openmm.new_Context(*args)
Exception: Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

= ERROR end: MPI rank 5 hostname gpu-2-15.local gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 2 hostname gpu-2-12.local gpuid 0 =
Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

Traceback (most recent call last):
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 110, in simulate_implicit_md
    modeller.addHydrogens(forcefield, pH=ph, variants=reference_variants)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/app/modeller.py", line 853, in addHydrogens
    context = Context(system, VerletIntegrator(0.0))
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15050, in __init__
    this = _openmm.new_Context(*args)
Exception: Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

= ERROR end: MPI rank 2 hostname gpu-2-12.local gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_5CT7_B in implicit solvent for 100.0 ps (MPI rank: 2, GPU ID: 0)
-------------------------------------------------------------------------
/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 1 hostname gpu-2-12.local gpuid 0 =
Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

Traceback (most recent call last):
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 110, in simulate_implicit_md
    modeller.addHydrogens(forcefield, pH=ph, variants=reference_variants)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/app/modeller.py", line 853, in addHydrogens
    context = Context(system, VerletIntegrator(0.0))
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15050, in __init__
    this = _openmm.new_Context(*args)
Exception: Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

= ERROR end: MPI rank 1 hostname gpu-2-12.local gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 3 hostname gpu-2-12.local gpuid 0 =
Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

Traceback (most recent call last):
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 110, in simulate_implicit_md
    modeller.addHydrogens(forcefield, pH=ph, variants=reference_variants)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/app/modeller.py", line 853, in addHydrogens
    context = Context(system, VerletIntegrator(0.0))
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 15050, in __init__
    this = _openmm.new_Context(*args)
Exception: Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

= ERROR end: MPI rank 3 hostname gpu-2-12.local gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py:300: UserWarning: = ERROR start: MPI rank 2 hostname gpu-2-12.local gpuid 0 =
Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

Traceback (most recent call last):
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 288, in refine_implicit_md
    simulate_implicit_md()
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/ensembler/refinement.py", line 110, in simulate_implicit_md
    modeller.addHydrogens(forcefield, pH=ph, variants=reference_variants)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/app/modeller.py", line 857, in addHydrogens
    LocalEnergyMinimizer.minimize(context, 1.0, 50)
  File "/cbio/jclab/home/albaness/miniconda/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 12223, in minimize
    return _openmm.LocalEnergyMinimizer_minimize(*args)
Exception: Error launching CUDA compiler: 256
<built-in>:0:0: fatal error: when writing output to : Bad file descriptor
compilation terminated.

= ERROR end: MPI rank 2 hostname gpu-2-12.local gpuid 0
  mpistate.rank, socket.gethostname(), gpuid, e, trbk
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3II5_A in implicit solvent for 100.0 ps (MPI rank: 12, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3D4Q_A in implicit solvent for 100.0 ps (MPI rank: 8, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_1UWH_A in implicit solvent for 100.0 ps (MPI rank: 0, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_2FB8_A in implicit solvent for 100.0 ps (MPI rank: 4, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3PPJ_A in implicit solvent for 100.0 ps (MPI rank: 0, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3PRF_A in implicit solvent for 100.0 ps (MPI rank: 4, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3PSB_A in implicit solvent for 100.0 ps (MPI rank: 8, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3Q4C_A in implicit solvent for 100.0 ps (MPI rank: 12, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3SKC_A in implicit solvent for 100.0 ps (MPI rank: 0, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_3TV6_A in implicit solvent for 100.0 ps (MPI rank: 4, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4E26_A in implicit solvent for 100.0 ps (MPI rank: 8, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4EHE_A in implicit solvent for 100.0 ps (MPI rank: 12, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4FC0_A in implicit solvent for 100.0 ps (MPI rank: 0, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4G9C_A in implicit solvent for 100.0 ps (MPI rank: 4, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4KSP_B in implicit solvent for 100.0 ps (MPI rank: 0, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4MBJ_B in implicit solvent for 100.0 ps (MPI rank: 4, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4H58_A in implicit solvent for 100.0 ps (MPI rank: 8, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4JVG_C in implicit solvent for 100.0 ps (MPI rank: 12, GPU ID: 0)
-------------------------------------------------------------------------
-------------------------------------------------------------------------
Simulating BRAF_HUMAN_D0 => BRAF_HUMAN_D0_4PP7_B in implicit solvent for 100.0 ps (MPI rank: 12, GPU ID: 0)
-------------------------------------------------------------------------
Done.
Compute mode is already set to DEFAULT for GPU 0000:84:00.0.
All done.
Compute mode is already set to DEFAULT for GPU 0000:83:00.0.
All done.
Compute mode is already set to DEFAULT for GPU 0000:04:00.0.
All done.
Compute mode is already set to DEFAULT for GPU 0000:03:00.0.
All done.

Is this a problem with my clusterutils setup? It looks like the errors are with cuda on only certain nodes.

jchodera commented 8 years ago

First, check if you can run nvcc on those nodes:

ssh gpu-2-12 nvcc --version

If not, your module environment may not have CUDA 7.0 uniformly selected. My ~/.modulerc is:

[chodera@mskcc-ln1 ~albaness]$ cat ~/.modulerc
#%Module
module remove gcc
module add cmake
module add cuda/7.0
module add gcc
jchodera commented 8 years ago

This could also be a red herring and there could be some other error causing files to be closed early that is being reported as a Bad file descriptor error.

steven-albanese commented 8 years ago

I think it was a problem with my module environment, since I hadn't configured that at all. With your .modulerc, it seems to be working at the moment. Thanks!

steven-albanese commented 8 years ago

Yeah, I just ran the explicit refinement step, no problems at all!