choderalab / yank

An open, extensible Python framework for GPU-accelerated alchemical free energy calculations.
http://getyank.org
MIT License
179 stars 70 forks source link

Startup time seems to be absurdly long #681

Closed jchodera closed 7 years ago

jchodera commented 7 years ago

It's going to really slow down our ability to debug YANK (as well as throughput even for fast calculations) if the startup time is so absurdly long.

I suggest we use some sort of timing/logging facility to record how much time is spent in each step so that we can get some hard data on what takes so long. A small amount of targeted optimization may reduce a lot of pain, though there are likely a lot of steps that could potentially be trivially parallelized as well.

jchodera commented 7 years ago

I'm up to an hour already running the yank-examples/examples/binding/abl-imatinib/ example and it hasn't even started yet:

Running simulation via MPI...
2017-05-21 10:45:16,221: Setting CUDA platform to use precision model 'mixed'.
2017-05-21 10:45:16,248: Node 1/20: executing <function YamlBuilder._check_resume at 0x7f1689dbcd90>
2017-05-21 10:45:16,251: Node 1/20: waiting for barrier after <function YamlBuilder._check_resume at 0x7f1689dbcd90>
2017-05-21 10:45:16,360: Node 1/20: executing <function YamlBuilder._setup_experiments at 0x7f1689dbf0d0>
2017-05-21 10:45:16,360: Setting up the systems for Abl, STI0 and pme
2017-05-21 10:47:03,504: 4-[(4-methylpiperazin-4-ium-1-yl)methyl]-~{N}-[4-methyl-3-[[4-(3-pyridyl)pyrimidin-2-yl]amino]phenyl]benzamide
2017-05-21 10:48:24,317: Setting up the systems for Abl, STI1 and pme
2017-05-21 10:49:26,951: 4-[(4-methylpiperazin-1-ium-1-yl)methyl]-~{N}-[4-methyl-3-[[4-(3-pyridyl)pyrimidin-2-yl]amino]phenyl]benzamide
2017-05-21 10:51:15,080: Setting up the systems for Abl, STI2 and pme
2017-05-21 10:52:43,480: 4-[(4-methylpiperazin-1-yl)methyl]-~{N}-[4-methyl-3-[[4-(3-pyridyl)pyrimidin-2-yl]amino]phenyl]benzamide
2017-05-21 10:53:41,521: Node 1/20: waiting for barrier after <function YamlBuilder._setup_experiments at 0x7f1689dbf0d0>
2017-05-21 10:53:41,524: Node 1/20: executing <function YamlBuilder._safe_makedirs at 0x7f1689dbf2f0>
2017-05-21 10:53:41,529: Node 1/20: waiting for barrier after <function YamlBuilder._safe_makedirs at 0x7f1689dbf2f0>
2017-05-21 10:53:41,719: Node 1/20: executing <bound method YamlBuilder._generate_yaml of <yank.yamlbuild.YamlBuilder object at 0x7f16baafe4a8>>
2017-05-21 10:53:41,779: DSL string for the ligand: "resname MOL"
2017-05-21 10:53:41,779: DSL string for the solvent: "auto"
2017-05-21 10:53:41,781: Reading phase complex
2017-05-21 10:53:41,782: prmtop: explicit/setup/systems/Abl-STI_STI0/complex.prmtop
2017-05-21 10:53:41,782: inpcrd: explicit/setup/systems/Abl-STI_STI0/complex.inpcrd
2017-05-21 10:53:54,097: ligand_atoms net charge: 1
2017-05-21 10:53:54,652: Ions net charges: [('Cl-', -1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1)]
2017-05-21 10:53:54,654: Reading phase solvent
2017-05-21 10:53:54,654: prmtop: explicit/setup/systems/Abl-STI_STI0/solvent.prmtop
2017-05-21 10:53:54,654: inpcrd: explicit/setup/systems/Abl-STI_STI0/solvent.inpcrd
2017-05-21 10:53:56,094: solute_atoms net charge: 1
2017-05-21 10:53:56,141: Ions net charges: [('Cl-', -1)]
2017-05-21 10:53:56,141: Node 1/20: executing <function YamlBuilder._save_analysis_script at 0x7f1689dbf158>
2017-05-21 10:53:56,146: Initial checkpoint file automatically chosen as explicit/experiments/AblSTISTI0/complex_checkpoint.nc
2017-05-21 10:54:40,754: Creating receptor-ligand restraints...
2017-05-21 10:54:40,754: There are undefined restraint parameters. Trying automatic parametrization.
2017-05-21 10:54:40,755: Automatically selecting restraint atoms and parameters:
2017-05-21 10:54:40,755: Attempt 0 / 100 at automatically selecting atoms and restraint parameters...
2017-05-21 10:54:42,301: restraint parameters:
                     K_r : 20.0 kcal/(A**2 mol)
                   r_aA0 : 0.37732 nm
                K_thetaA : 20.0 kcal/(mol rad**2)
                theta_A0 : 1.0874 rad
                K_thetaB : 20.0 kcal/(mol rad**2)
                theta_B0 : 1.62612 rad
                  K_phiA : 20.0 kcal/(mol rad**2)
                  phi_A0 : -0.83044 rad
                  K_phiB : 20.0 kcal/(mol rad**2)
                  phi_B0 : -1.03741 rad
                  K_phiC : 20.0 kcal/(mol rad**2)
                  phi_C0 : 2.21118 rad

2017-05-21 10:54:49,985: Creating alchemically-modified states...
2017-05-21 10:55:23,526: Create alchemically modified system took   29.723s
2017-05-21 10:57:04,733: WARNING - yank.restraints - The provided restrained atoms do not guarantee a robust calculation of the standard state correction. Switching to the numerical scheme.
2017-05-21 10:59:07,372: WARNING - yank.restraints - The provided restrained atoms do not guarantee a robust calculation of the standard state correction. Switching to the numerical scheme.
2017-05-21 11:38:11,900: Creating expanded cutoff states...
2017-05-21 11:38:36,266: Setting cutoff for fully interacting system to maximum allowed 16.0 A
2017-05-21 11:39:51,985: Setting cutoff for fully interacting system to maximum allowed 16.0 A
2017-05-21 11:40:03,948: Creating sampler object...
2017-05-21 11:40:03,954: Node 1/20: executing <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x7f1673303828>>
2017-05-21 11:40:03,963: Node 1/20: waiting for broadcast of <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x7f1673303828>>
jchodera commented 7 years ago

Any idea what it was doing for the 40 minutes between 10:59:07 and 11:38:11?

jchodera commented 7 years ago

I don't think yank is swapping to disk here. Each process is only using ~4.4GB:

11164 chodera   20   0 5897m 4.4g 2676 R 100.4  1.8  71:57.97 yank                                                                                                                                                                                                                                                                                    
11166 chodera   20   0 5630m 4.4g 2576 R 100.4  1.7  79:25.98 yank                                                                                                                                                                                                                                                                                    
11167 chodera   20   0 5629m 4.4g 2576 R 100.4  1.8  77:47.10 yank                                                                                                                                                                                                                                                                                    
11165 chodera   20   0 3344m 2.3g 2716 R 98.5  0.9  79:15.47 yank   

Some of the nodes have osprey running on them using a ton of memory (>80GB), and that may be swapping somewhat, however. But I'm really not sure what's going on between these two timestamps:

2017-05-21 10:59:07,372: WARNING - yank.restraints - The provided restrained atoms do not guarantee a robust calculation of the standard state correction. Switching to the numerical scheme.
2017-05-21 11:38:11,900: Creating expanded cutoff states...
Lnaden commented 7 years ago

You could run cprofile to see what happens. Does this also happen in serial mode? If not, then it might be the mpi barriers

jchodera commented 7 years ago

I bet there are a number of frameworks for logging and aggregating timing events for individual functions, but there are some cloud-based services we might consider too if we want to collect instrumented data from larger deployments to continue to improve yank timings.

Some possibilities:

jchodera commented 7 years ago

You could run cprofile to see what happens. Does this also happen in serial mode? If not, then it might be the mpi barriers

The MPI barriers waiting for ... what exactly?

jchodera commented 7 years ago

I think this is the problem: We have no idea what we're waiting on. It could be one of the nodes is being bogged down with some other user's process, but we have no idea what is going on there at the moment. Can we add some instrumentation to allow us to figure this out?

jchodera commented 7 years ago

Finally got some more logging output:

2017-05-21 11:40:03,948: Creating sampler object...
2017-05-21 11:40:03,954: Node 1/20: executing <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x7f1673303828>>
2017-05-21 11:40:03,963: Node 1/20: waiting for broadcast of <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x7f1673303828>>
2017-05-21 12:34:42,677: Node 1/20: executing <function ReplicaExchange._initialize_reporter at 0x7f16939a9ae8>
2017-05-21 12:35:07,605: Serialized state thermodynamic_states/0 is  39974315B | 39037.417KB | 38.122MB
jchodera commented 7 years ago

Hm. I wonder if the issue is that YANK is running on nodes that @rafwiewiora is running osprey on. Those jobs are using a huge amount of virtual memory, and using the gpu queue on hal may mean that we skip by the memory limitations of the queue configuration. Checking.

jchodera commented 7 years ago

I think that was the problem. I'm running on lilac now and things progress much more quickly (though still quite slowly overall). It would be great to instrument the code and see which steps are slow.

jchodera commented 7 years ago

Running on one node of lilac produces this:

2017-05-21 13:35:01,016: Node 1/4: waiting for barrier after <function YamlBuilder._setup_experiments at 0x2b616a346a60>
2017-05-21 13:35:01,017: Node 1/4: executing <function YamlBuilder._safe_makedirs at 0x2b616a346c80>
2017-05-21 13:35:01,020: Node 1/4: waiting for barrier after <function YamlBuilder._safe_makedirs at 0x2b616a346c80>
2017-05-21 13:35:01,025: Node 1/4: executing <bound method YamlBuilder._generate_yaml of <yank.yamlbuild.YamlBuilder object at 0x2b61405ee080>>
2017-05-21 13:35:01,067: DSL string for the ligand: "resname MOL"
2017-05-21 13:35:01,067: DSL string for the solvent: "auto"
2017-05-21 13:35:01,068: Reading phase complex
2017-05-21 13:35:01,069: prmtop: explicit/setup/systems/Abl-STI_STI0/complex.prmtop
2017-05-21 13:35:01,069: inpcrd: explicit/setup/systems/Abl-STI_STI0/complex.inpcrd
2017-05-21 13:35:11,801: ligand_atoms net charge: 1
2017-05-21 13:35:12,440: Ions net charges: [('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Cl-', -1)]
2017-05-21 13:35:12,441: Reading phase solvent
2017-05-21 13:35:12,441: prmtop: explicit/setup/systems/Abl-STI_STI0/solvent.prmtop
2017-05-21 13:35:12,441: inpcrd: explicit/setup/systems/Abl-STI_STI0/solvent.inpcrd
2017-05-21 13:35:13,683: solute_atoms net charge: 1
2017-05-21 13:35:13,728: Ions net charges: [('Cl-', -1)]
2017-05-21 13:35:13,729: Node 1/4: executing <function YamlBuilder._save_analysis_script at 0x2b616a346ae8>
2017-05-21 13:35:13,733: Initial checkpoint file automatically chosen as explicit/experiments/AblSTISTI0/complex_checkpoint.nc
2017-05-21 13:35:52,333: Creating receptor-ligand restraints...
2017-05-21 13:35:52,334: There are undefined restraint parameters. Trying automatic parametrization.
2017-05-21 13:35:52,334: Automatically selecting restraint atoms and parameters:
2017-05-21 13:35:52,334: Attempt 0 / 100 at automatically selecting atoms and restraint parameters...
2017-05-21 13:35:53,996: restraint parameters:
                  phi_B0 : -0.84833 rad
                theta_B0 : 1.7485 rad
                   r_aA0 : 0.398983 nm
                  phi_C0 : -1.11721 rad
                theta_A0 : 2.19225 rad
                  K_phiC : 20.0 kcal/(mol rad**2)
                     K_r : 20.0 kcal/(A**2 mol)
                  K_phiB : 20.0 kcal/(mol rad**2)
                  phi_A0 : -0.893556 rad
                K_thetaB : 20.0 kcal/(mol rad**2)
                  K_phiA : 20.0 kcal/(mol rad**2)
                K_thetaA : 20.0 kcal/(mol rad**2)

2017-05-21 13:36:00,428: Creating alchemically-modified states...
2017-05-21 13:36:29,808: Create alchemically modified system took   26.161s
2017-05-21 13:38:38,261: WARNING - yank.restraints - The provided restrained atoms do not guarantee a robust calculation of the standard state correction. Switching to the numerical scheme.
2017-05-21 14:10:52,568: Creating expanded cutoff states...
2017-05-21 14:11:12,494: Setting cutoff for fully interacting system to maximum allowed 16.0 A
2017-05-21 14:12:11,684: Setting cutoff for fully interacting system to maximum allowed 16.0 A
2017-05-21 14:12:21,046: Creating sampler object...
2017-05-21 14:12:21,052: Node 1/4: executing <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x2b618126ab70>>
2017-05-21 14:12:21,056: Node 1/4: waiting for broadcast of <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x2b618126ab70>>
2017-05-21 14:55:23,521: Node 1/4: executing <function ReplicaExchange._initialize_reporter at 0x2b6164d57158>
2017-05-21 14:55:43,418: Serialized state thermodynamic_states/0 is  39974317B | 39037.419KB | 38.122MB
2017-05-21 15:14:45,222: Serialized state unsampled_states/0 is  14592701B | 14250.685KB | 13.917MB
2017-05-21 15:15:50,344: Serialized state unsampled_states/1 is  39974242B | 39037.346KB | 38.122MB
2017-05-21 15:17:54,011: Storing thermodynamic states took 1350.474s
2017-05-21 15:17:54,012: Storing general ReplicaExchange options...
2017-05-21 15:19:00,330: Node 1/4: executing <function ReplicaExchange._report_iteration at 0x2b6164d57400>
2017-05-21 15:19:00,461: Storing sampler states took    0.129s
2017-05-21 15:19:02,045: Writing iteration information to storage took    1.713s
2017-05-21 15:19:02,045: Node 1/4: waiting for barrier after <function ReplicaExchange._initialize_reporter at 0x2b6164d57158>
2017-05-21 15:19:10,651: Minimizing 1 sampler states in the reference thermodynamic state
2017-05-21 15:19:10,652: Node 1/4: execute _minimize_sampler_state(0)
2017-05-21 15:19:42,383: Sampler state 1/52: initial energy -60689.291kT
2017-05-21 15:23:58,034: Sampler state 1/52: final energy -313887.466kT
2017-05-21 15:23:58,034: Node 1/4: allgather results of _minimize_sampler_state
2017-05-21 15:23:58,043: Node 1/4: executing <bound method Reporter.write_sampler_states of <yank.repex.Reporter object at 0x2b618126ab70>>
2017-05-21 15:23:58,165: Storing sampler states took    0.122s
2017-05-21 15:23:58,165: Minimizing all replicas...
2017-05-21 15:23:58,166: Node 1/4: execute _minimize_replica(0)
2017-05-21 15:24:26,436: Replica 1/52: initial energy -313928.888kT
2017-05-21 15:26:25,413: Replica 1/52: final energy -314226.784kT
2017-05-21 15:26:25,413: Node 1/4: execute _minimize_replica(4)
2017-05-21 15:26:25,468: Replica 5/52: initial energy -313921.869kT
2017-05-21 15:28:47,875: Replica 5/52: final energy -314305.295kT
2017-05-21 15:28:47,875: Node 1/4: execute _minimize_replica(8)
2017-05-21 15:28:47,932: Replica 9/52: initial energy -313893.795kT
2017-05-21 15:31:13,193: Replica 9/52: final energy -314291.759kT
2017-05-21 15:31:13,193: Node 1/4: execute _minimize_replica(12)
2017-05-21 15:31:13,256: Replica 13/52: initial energy -313731.166kT
2017-05-21 15:32:53,956: Replica 13/52: final energy -314002.794kT
2017-05-21 15:32:53,957: Node 1/4: execute _minimize_replica(16)
2017-05-21 15:32:54,011: Replica 17/52: initial energy -313568.538kT
2017-05-21 15:35:35,300: Replica 17/52: final energy -314059.045kT
2017-05-21 15:35:35,301: Node 1/4: execute _minimize_replica(20)
2017-05-21 15:35:35,357: Replica 21/52: initial energy -313405.909kT
2017-05-21 15:38:19,179: Replica 21/52: final energy -313936.109kT
2017-05-21 15:38:19,179: Node 1/4: execute _minimize_replica(24)
2017-05-21 15:38:19,237: Replica 25/52: initial energy -313243.281kT
2017-05-21 15:40:34,518: Replica 25/52: final energy -313711.292kT
2017-05-21 15:40:34,519: Node 1/4: execute _minimize_replica(28)
2017-05-21 15:40:34,576: Replica 29/52: initial energy -313080.653kT
2017-05-21 15:42:30,057: Replica 29/52: final energy -313570.321kT
2017-05-21 15:42:30,057: Node 1/4: execute _minimize_replica(32)
2017-05-21 15:42:32,446: Replica 33/52: initial energy -313075.459kT
2017-05-21 15:45:16,345: Replica 33/52: final energy -313665.582kT
2017-05-21 15:45:16,346: Node 1/4: execute _minimize_replica(36)
2017-05-21 15:45:18,937: Replica 37/52: initial energy -313055.172kT
2017-05-21 15:47:12,438: Replica 37/52: final energy -313514.038kT
2017-05-21 15:47:12,440: Node 1/4: execute _minimize_replica(40)
2017-05-21 15:47:14,883: Replica 41/52: initial energy -313028.199kT
...

It takes 104 minutes to get to get from the start:

2017-05-21 13:35:01,016: Node 1/4: waiting for barrier after <function YamlBuilder._setup_experiments at 0x2b616a346a60>

too the point where minimization starts

2017-05-21 15:19:10,651: Minimizing 1 sampler states in the reference thermodynamic state

After that, it seems to be ~3 min/replica to minimize.

The biggest gap in time seems to be between these log statements:

2017-05-21 14:12:21,056: Node 1/4: waiting for broadcast of <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x2b618126ab70>>
2017-05-21 14:55:23,521: Node 1/4: executing <function ReplicaExchange._initialize_reporter at 0x2b6164d57158>

What could be going on that takes 43 minutes here?

Something seems very wrong.

jchodera commented 7 years ago

I finally managed to get to the propagation stage on lilac! It took 250 minutes from start to first Propagating all replicas...:

-bash-4.2$ cat abl-imatinib-explicit.356506.log
Sender: LSF System <lsfadmin@ls13>
Subject: Job 356506: <abl-imatinib-explicit> in cluster <lila> Exited

Job <abl-imatinib-explicit> was submitted from host <lilac> by user <chodera> in cluster <lila> at Sun May 21 17:49:54 2017.
Job was executed on host(s) <2*ls13>, in queue <general>, as user <chodera> in cluster <lila> at Sun May 21 17:49:54 2017.
                            <2*ls15>
</home/chodera> was used as the home directory.
</home/chodera/github/yank-examples/examples/binding/abl-imatinib> was used as the working directory.
Started at Sun May 21 17:49:54 2017.
Terminated at Sun May 21 20:49:16 2017.
Results reported at Sun May 21 20:49:16 2017.

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -W 24:00
#BSUB -o abl-imatinib-explicit.%J.log
#BSUB -env "all, LSB_START_JOB_MPS=Y"
#BSUB -n 4 -R "rusage[ngpus_excl_p=1,mem=8]"
#BSUB -m ls-gpu
#BSUB -J "abl-imatinib-explicit"

# Run the simulation with verbose output:
#source activate yank16
export PREFIX="explicit"
build_mpirun_configfile  "yank script --yaml=$PREFIX.yaml"
mpiexec.hydra -f hostfile -configfile configfile

------------------------------------------------------------

Exited with exit code 255.

Resource usage summary:

    CPU time :                                   42001.46 sec.
    Max Memory :                                 22 GB
    Average Memory :                             15.57 GB
    Total Requested Memory :                     32.00 GB
    Delta Memory :                               10.00 GB
    Max Swap :                                   -
    Max Processes :                              17
    Max Threads :                                106
    Run time :                                   10763 sec.
    Turnaround time :                            10762 sec.

The output (if any) follows:

2017-05-21 17:50:00,374: Setting CUDA platform to use precision model 'mixed'.
2017-05-21 17:50:00,385: Node 1/4: executing <function YamlBuilder._check_resume at 0x2b1387c517b8>
2017-05-21 17:50:00,387: Node 1/4: waiting for barrier after <function YamlBuilder._check_resume at 0x2b1387c517b8>
2017-05-21 17:50:00,394: Node 1/4: executing <function YamlBuilder._setup_experiments at 0x2b1387c51a60>
2017-05-21 17:50:00,395: Setting up the systems for Abl, STI_0 and pme
2017-05-21 17:50:33,641: 4-[(4-methylpiperazin-4-ium-1-yl)methyl]-~{N}-[4-methyl-3-[[4-(3-pyridyl)pyrimidin-2-yl]amino]phenyl]benzamide
2017-05-21 17:51:11,949: Setting up the systems for Abl, STI_1 and pme
2017-05-21 17:51:49,095: 4-[(4-methylpiperazin-1-ium-1-yl)methyl]-~{N}-[4-methyl-3-[[4-(3-pyridyl)pyrimidin-2-yl]amino]phenyl]benzamide
2017-05-21 17:52:26,412: Setting up the systems for Abl, STI_2 and pme
2017-05-21 17:52:58,559: 4-[(4-methylpiperazin-1-yl)methyl]-~{N}-[4-methyl-3-[[4-(3-pyridyl)pyrimidin-2-yl]amino]phenyl]benzamide
2017-05-21 17:53:31,540: Node 1/4: waiting for barrier after <function YamlBuilder._setup_experiments at 0x2b1387c51a60>
2017-05-21 17:53:31,541: Node 1/4: executing <function YamlBuilder._safe_makedirs at 0x2b1387c51c80>
2017-05-21 17:53:31,542: Node 1/4: waiting for barrier after <function YamlBuilder._safe_makedirs at 0x2b1387c51c80>
2017-05-21 17:53:31,546: Node 1/4: executing <bound method YamlBuilder._generate_yaml of <yank.yamlbuild.YamlBuilder object at 0x2b135b59c518>>
2017-05-21 17:53:31,568: DSL string for the ligand: "resname MOL"
2017-05-21 17:53:31,569: DSL string for the solvent: "auto"
2017-05-21 17:53:31,573: Reading phase complex
2017-05-21 17:53:31,574: prmtop: explicit/setup/systems/Abl-STI_STI0/complex.prmtop
2017-05-21 17:53:31,574: inpcrd: explicit/setup/systems/Abl-STI_STI0/complex.inpcrd
2017-05-21 17:53:37,440: ligand_atoms net charge: 1
2017-05-21 17:53:37,818: Ions net charges: [('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Na+', 1), ('Cl-', -1)]
2017-05-21 17:53:37,818: Reading phase solvent
2017-05-21 17:53:37,819: prmtop: explicit/setup/systems/Abl-STI_STI0/solvent.prmtop
2017-05-21 17:53:37,819: inpcrd: explicit/setup/systems/Abl-STI_STI0/solvent.inpcrd
2017-05-21 17:53:38,461: solute_atoms net charge: 1
2017-05-21 17:53:38,483: Ions net charges: [('Cl-', -1)]
2017-05-21 17:53:38,483: Node 1/4: executing <function YamlBuilder._save_analysis_script at 0x2b1387c51ae8>
2017-05-21 17:53:38,488: Initial checkpoint file automatically chosen as explicit/experiments/AblSTISTI0/complex_checkpoint.nc
2017-05-21 17:54:00,637: Creating receptor-ligand restraints...
2017-05-21 17:54:00,638: There are undefined restraint parameters. Trying automatic parametrization.
2017-05-21 17:54:00,638: Automatically selecting restraint atoms and parameters:
2017-05-21 17:54:00,638: Attempt 0 / 100 at automatically selecting atoms and restraint parameters...
2017-05-21 17:54:01,617: restraint parameters:
                  phi_C0 : -1.15517 rad
                  K_phiB : 20.0 kcal/(mol rad**2)
                   r_aA0 : 0.388129 nm
                  phi_A0 : -1.70351 rad
                theta_A0 : 0.716615 rad
                  K_phiC : 20.0 kcal/(mol rad**2)
                theta_B0 : 1.0231 rad
                K_thetaB : 20.0 kcal/(mol rad**2)
                     K_r : 20.0 kcal/(A**2 mol)
                  K_phiA : 20.0 kcal/(mol rad**2)
                  phi_B0 : 1.30322 rad
                K_thetaA : 20.0 kcal/(mol rad**2)

2017-05-21 17:54:05,124: Creating alchemically-modified states...
2017-05-21 17:54:20,086: Create alchemically modified system took   13.224s
2017-05-21 18:15:03,971: Creating expanded cutoff states...
2017-05-21 18:15:16,951: Setting cutoff for fully interacting system to maximum allowed 16.0 A
2017-05-21 18:15:53,455: Setting cutoff for fully interacting system to maximum allowed 16.0 A
2017-05-21 18:15:59,035: Creating sampler object...
2017-05-21 18:15:59,041: Node 1/4: executing <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x2b139e8377f0>>
2017-05-21 18:15:59,045: Node 1/4: waiting for broadcast of <bound method Reporter.storage_exists of <yank.repex.Reporter object at 0x2b139e8377f0>>
2017-05-21 18:42:14,371: Node 1/4: executing <function ReplicaExchange._initialize_reporter at 0x2b1382452158>
2017-05-21 18:42:26,552: Serialized state thermodynamic_states/0 is  39974314B | 39037.416KB | 38.122MB
2017-05-21 18:53:56,284: Serialized state unsampled_states/0 is  14592698B | 14250.682KB | 13.917MB
2017-05-21 18:54:31,041: Serialized state unsampled_states/1 is  39974239B | 39037.343KB | 38.122MB
2017-05-21 18:55:32,457: Storing thermodynamic states took  798.055s
2017-05-21 18:55:32,458: Storing general ReplicaExchange options...
2017-05-21 18:56:06,254: Node 1/4: executing <function ReplicaExchange._report_iteration at 0x2b1382452400>
2017-05-21 18:56:06,346: Storing sampler states took    0.092s
2017-05-21 18:56:07,430: Writing iteration information to storage took    1.175s
2017-05-21 18:56:07,430: Node 1/4: waiting for barrier after <function ReplicaExchange._initialize_reporter at 0x2b1382452158>
2017-05-21 19:16:12,751: Minimizing 1 sampler states in the reference thermodynamic state
2017-05-21 19:16:12,752: Node 1/4: execute _minimize_sampler_state(0)
2017-05-21 19:16:35,451: Sampler state 1/52: initial energy -60689.291kT
2017-05-21 19:23:16,351: Sampler state 1/52: final energy -313895.538kT
2017-05-21 19:23:16,352: Node 1/4: allgather results of _minimize_sampler_state
2017-05-21 19:23:16,362: Node 1/4: executing <bound method Reporter.write_sampler_states of <yank.repex.Reporter object at 0x2b139e8377f0>>
2017-05-21 19:23:16,441: Storing sampler states took    0.078s
2017-05-21 19:23:16,441: Minimizing all replicas...
2017-05-21 19:23:16,441: Node 1/4: execute _minimize_replica(0)
2017-05-21 19:23:34,897: Replica 1/52: initial energy -313944.711kT
2017-05-21 19:25:10,704: Replica 1/52: final energy -314121.707kT
2017-05-21 19:25:10,704: Node 1/4: execute _minimize_replica(4)
2017-05-21 19:25:10,738: Replica 5/52: initial energy -313939.364kT
2017-05-21 19:26:26,983: Replica 5/52: final energy -314080.254kT
2017-05-21 19:26:26,983: Node 1/4: execute _minimize_replica(8)
2017-05-21 19:26:27,014: Replica 9/52: initial energy -313917.984kT
2017-05-21 19:27:58,373: Replica 9/52: final energy -314107.115kT
2017-05-21 19:27:58,374: Node 1/4: execute _minimize_replica(12)
2017-05-21 19:27:58,405: Replica 13/52: initial energy -313755.406kT
2017-05-21 19:29:27,956: Replica 13/52: final energy -313954.692kT
2017-05-21 19:29:27,956: Node 1/4: execute _minimize_replica(16)
2017-05-21 19:29:28,007: Replica 17/52: initial energy -313592.829kT
2017-05-21 19:31:27,093: Replica 17/52: final energy -313888.378kT
2017-05-21 19:31:27,094: Node 1/4: execute _minimize_replica(20)
2017-05-21 19:31:27,125: Replica 21/52: initial energy -313430.252kT
2017-05-21 19:33:24,442: Replica 21/52: final energy -313744.089kT
2017-05-21 19:33:24,442: Node 1/4: execute _minimize_replica(24)
2017-05-21 19:33:24,474: Replica 25/52: initial energy -313267.674kT
2017-05-21 19:34:40,992: Replica 25/52: final energy -313525.440kT
2017-05-21 19:34:40,993: Node 1/4: execute _minimize_replica(28)
2017-05-21 19:34:41,025: Replica 29/52: initial energy -313105.097kT
2017-05-21 19:36:24,504: Replica 29/52: final energy -313480.575kT
2017-05-21 19:36:24,504: Node 1/4: execute _minimize_replica(32)
2017-05-21 19:36:26,026: Replica 33/52: initial energy -313103.285kT
2017-05-21 19:37:58,593: Replica 33/52: final energy -313431.210kT
2017-05-21 19:37:58,593: Node 1/4: execute _minimize_replica(36)
2017-05-21 19:38:00,198: Replica 37/52: initial energy -313084.447kT
2017-05-21 19:40:46,853: Replica 37/52: final energy -313629.620kT
2017-05-21 19:40:46,853: Node 1/4: execute _minimize_replica(40)
2017-05-21 19:40:48,395: Replica 41/52: initial energy -313058.127kT
2017-05-21 19:43:38,576: Replica 41/52: final energy -313612.709kT
2017-05-21 19:43:38,576: Node 1/4: execute _minimize_replica(44)
2017-05-21 19:43:40,115: Replica 45/52: initial energy -313028.252kT
2017-05-21 19:46:54,772: Replica 45/52: final energy -313600.721kT
2017-05-21 19:46:54,772: Node 1/4: execute _minimize_replica(48)
2017-05-21 19:46:56,361: Replica 49/52: initial energy -313004.682kT
2017-05-21 19:49:00,335: Replica 49/52: final energy -313347.556kT
2017-05-21 19:49:00,335: Node 1/4: gather results of _minimize_replica
2017-05-21 19:53:02,550: Node 1/4: executing <bound method Reporter.write_sampler_states of <yank.repex.Reporter object at 0x2b139e8377f0>>
2017-05-21 19:53:02,631: Storing sampler states took    0.081s
2017-05-21 19:53:02,632: Minimizing all replicas took 1786.191s
2017-05-21 19:53:02,646: Node 1/4: execute _compute_replica_energies(0)
2017-05-21 19:54:04,007: Node 1/4: execute _compute_replica_energies(4)
2017-05-21 19:54:41,496: Node 1/4: execute _compute_replica_energies(8)
2017-05-21 19:55:17,735: Node 1/4: execute _compute_replica_energies(12)
2017-05-21 19:55:55,275: Node 1/4: execute _compute_replica_energies(16)
2017-05-21 19:56:32,061: Node 1/4: execute _compute_replica_energies(20)
2017-05-21 19:57:08,573: Node 1/4: execute _compute_replica_energies(24)
2017-05-21 19:57:45,171: Node 1/4: execute _compute_replica_energies(28)
2017-05-21 19:58:21,138: Node 1/4: execute _compute_replica_energies(32)
2017-05-21 19:58:57,826: Node 1/4: execute _compute_replica_energies(36)
2017-05-21 19:59:34,558: Node 1/4: execute _compute_replica_energies(40)
2017-05-21 20:00:11,184: Node 1/4: execute _compute_replica_energies(44)
2017-05-21 20:00:47,933: Node 1/4: execute _compute_replica_energies(48)
2017-05-21 20:01:25,495: Node 1/4: gather results of _compute_replica_energies
2017-05-21 20:20:17,481: Computing energy matrix took 1634.835s
2017-05-21 20:20:17,481: Node 1/4: executing <bound method Reporter.write_energies of <yank.repex.Reporter object at 0x2b139e8377f0>>
2017-05-21 20:20:17,482: Iteration 1/100
2017-05-21 20:20:17,482: Node 1/4: executing <function ReplicaExchange._mix_replicas at 0x2b1382452950>
2017-05-21 20:20:17,483: Mixing replicas...
2017-05-21 20:20:17,853: Mixing of replicas took    0.371s
2017-05-21 20:20:17,854: Accepted 943768/14623232 attempted swaps (6.5%)
2017-05-21 20:20:17,854: Node 1/4: waiting for broadcast of <function ReplicaExchange._mix_replicas at 0x2b1382452950>
2017-05-21 20:20:17,854: Propagating all replicas...

It ran for a little while, but died shortly thereafter

2017-05-21 20:47:50,613: Node 1/4: execute _propagate_replica(48)
2017-05-21 20:47:52,302: Applying MCDisplacementMove took    1.690s
2017-05-21 20:47:53,957: Applying MCRotationMove took    1.654s
2017-05-21 20:47:54,154: LangevinDynamicsMove: Context obtained, platform is CUDA
2017-05-21 20:48:22,890: LangevinDynamicsMove: Context request took    0.197s
2017-05-21 20:48:22,890: LangevinDynamicsMove: update sampler state took    0.001s
2017-05-21 20:48:22,890: LangevinDynamicsMove: step(2500) took   28.550s
2017-05-21 20:48:22,890: Node 1/4: gather results of _propagate_replica
2017-05-21 20:48:22,990: Node 1/4: execute _get_replica_move_statistics(0)
2017-05-21 20:48:22,990: Node 1/4: execute _get_replica_move_statistics(4)
2017-05-21 20:48:22,990: Node 1/4: execute _get_replica_move_statistics(8)
2017-05-21 20:48:22,990: Node 1/4: execute _get_replica_move_statistics(12)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(16)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(20)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(24)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(28)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(32)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(36)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(40)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(44)
2017-05-21 20:48:22,991: Node 1/4: execute _get_replica_move_statistics(48)
2017-05-21 20:48:22,992: Node 1/4: allgather results of _get_replica_move_statistics
2017-05-21 20:48:22,992: Propagating all replicas took  425.249s
2017-05-21 20:48:22,992: Node 1/4: execute _compute_replica_energies(0)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 47947 RUNNING AT ls15
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@ls13] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:0@ls13] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@ls13] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@ls13] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@ls13] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@ls13] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@ls13] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

It in fact dumped a core:

-rw------- 1 chodera jclab 5459243008 May 21 20:48 core.47947
jchodera commented 7 years ago

I restarted the job to try to resume, but while the CPU is throttled to 100%, there's nothing on the GPU 40 minutes after restarting it:

-bash-4.2$ nvidia-smi
Sun May 21 22:36:22 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48                 Driver Version: 367.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 0000:02:00.0     Off |                  N/A |
|  0%   37C    P8     9W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 1080    Off  | 0000:03:00.0     Off |                  N/A |
|  0%   41C    P8    11W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 1080    Off  | 0000:83:00.0     Off |                  N/A |
|  0%   36C    P8     9W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX 1080    Off  | 0000:84:00.0     Off |                  N/A |
|  0%   42C    P8     9W / 180W |      2MiB /  8113MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Output doesn't seem to be spooled immediately, so I have no idea what it's actually doing right now. I think this makes it all the more important to spool logs elsewhere.

jchodera commented 7 years ago

Restarting it on lilac ran for a few more iterations and then core dumped again:

2017-05-22 01:15:03,680: Computing energy matrix took  834.888s
2017-05-22 01:15:03,681: Node 1/4: executing <function ReplicaExchange._report_iteration at 0x2b21516be400>
2017-05-22 01:15:03,681: Iteration 6 not on the Checkpoint Interval of 10. Sampler State not written.
2017-05-22 01:15:03,681: Storing sampler states took    0.000s
2017-05-22 01:15:03,918: Writing iteration information to storage took    0.237s
2017-05-22 01:15:03,918: Iteration took 1260.025s.
2017-05-22 01:15:03,918: Estimated completion in 1 day, 8:52:22.106335, at Fri Jan  2 05:58:15 1970 (consuming total wall clock time 1 day, 10:58:15.857803).
2017-05-22 01:15:03,918: Iteration 7/100
2017-05-22 01:15:03,919: Node 1/4: executing <function ReplicaExchange._mix_replicas at 0x2b21516be950>
2017-05-22 01:15:03,919: Mixing replicas...
2017-05-22 01:15:04,273: Mixing of replicas took    0.354s
2017-05-22 01:15:04,273: Accepted 819046/14623232 attempted swaps (5.6%)
2017-05-22 01:15:04,273: Node 1/4: waiting for broadcast of <function ReplicaExchange._mix_replicas at 0x2b21516be950>
2017-05-22 01:15:04,273: Propagating all replicas...
2017-05-22 01:15:04,273: Node 1/4: execute _propagate_replica(0)
2017-05-22 01:15:05,896: Applying MCDisplacementMove took    1.622s
2017-05-22 01:15:06,065: Applying MCRotationMove took    0.170s
2017-05-22 01:15:16,381: LangevinDynamicsMove: Context obtained, platform is CUDA
2017-05-22 01:15:46,936: LangevinDynamicsMove: Context request took   10.315s
2017-05-22 01:15:46,937: LangevinDynamicsMove: step(2500) took   30.392s
2017-05-22 01:15:46,937: LangevinDynamicsMove: update sampler state took    0.001s
2017-05-22 01:15:46,937: Node 1/4: execute _propagate_replica(4)

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 58050 RUNNING AT ls13
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@ls02] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:1@ls02] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:1@ls02] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

At the head of the spool output file is

Sender: LSF System <lsfadmin@ls13>
Subject: Job 356507: <abl-imatinib-explicit> in cluster <lila> Exited

Job <abl-imatinib-explicit> was submitted from host <lilac> by user <chodera> in cluster <lila> at Sun May 21 21:54:55 2017.
Job was executed on host(s) <2*ls13>, in queue <general>, as user <chodera> in cluster <lila> at Sun May 21 21:54:55 2017.
                            <2*ls02>
</home/chodera> was used as the home directory.
</home/chodera/github/yank-examples/examples/binding/abl-imatinib> was used as the working directory.
Started at Sun May 21 21:54:55 2017.
Terminated at Mon May 22 01:16:26 2017.
Results reported at Mon May 22 01:16:26 2017.

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
#!/bin/bash
#BSUB -W 24:00
#BSUB -o abl-imatinib-explicit.%J.log
#BSUB -env "all, LSB_START_JOB_MPS=Y"
#BSUB -n 4 -R "rusage[ngpus_excl_p=1,mem=16]"
#BSUB -m ls-gpu
#BSUB -J "abl-imatinib-explicit"

# Run the simulation with verbose output:
#source activate yank16
export PREFIX="explicit"
build_mpirun_configfile  "yank script --yaml=$PREFIX.yaml"
mpiexec.hydra -f hostfile -configfile configfile

------------------------------------------------------------

Exited with exit code 139.

Resource usage summary:

    CPU time :                                   48135.70 sec.
    Max Memory :                                 13 GB
    Average Memory :                             10.81 GB
    Total Requested Memory :                     64.00 GB
    Delta Memory :                               51.00 GB
    Max Swap :                                   -
    Max Processes :                              20
    Max Threads :                                106
    Run time :                                   12091 sec.
    Turnaround time :                            12091 sec.

Any idea what's going on?

For reference, I'm in /home/chodera/github/yank-examples/examples/binding/abl-imatinib

jchodera commented 7 years ago

I'm using the May 12 dev build of openmm and the latest masters of openmmtools and yank:

>>> print(simtk.openmm.version.version)
7.2.0.dev-54ac989
andrrizzi commented 7 years ago

The latest openmmtools doesn't have the memory optimizations yet. That PR still needs review, can you try with branch optimizations? The startup was fine when I used that one with FreeSolv.

jchodera commented 7 years ago

Sure, I can give it a try, but it doesn't appear that memory consumption is the limiting factor according to the LSF readouts.

Are you running tests on lilac too?

andrrizzi commented 7 years ago

Both right after creating the alchemically modified system and checking that the storage exists, the code does a lot of deepcopy of the ThermodynamicStates. With the memory optimization the time and memory required to do that should be O(1).

andrrizzi commented 7 years ago

I run on hal.

jchodera commented 7 years ago

Ok, will give it a try!

Let's make sure to get you on lilac too. It looks underutilized at the moment.

andrrizzi commented 7 years ago

This should have been largely addressed in #684 . Feel free to reopen if still a problem, or open a new issue for specific idea.