Running adaptive sampling using AMBER on a cluster using SLURM or PBS

eric-jm-lang commented 7 years ago

Hello, I am very interested in using htmd to run some adaptive sampling simulations. However, the examples I have seen on adaptive sampling seem to deal only with ACEMD on a local GPU cluster. I would like to know if it is possible to run adaptive sampling using Amber on GPUs (i.e. pmemd.cuda) on a cluster that either relies on PBS or SLURM to manage the queue. If yes could you please let me know what I should specify in my scripts to be able to run such kind of adaptive sampling. Many thanks in advance Eric

jeiros commented 7 years ago

Also, I did change the line in adaptiverun.py to accept prmtop files, but since I've upgraded this morning to the new htmd version that was overwritten. I'll play around with changing it once I get my hands on the 1.7.13 release.

stefdoerr commented 7 years ago

Ah yes that ions escaping thing is horrible but it's not our fault. VMD (whos atomselection syntax we use) is just weird like that.

Yes try with rst7, just remember to read it also correctly in your input file. 1.7.13 is out so give it a try whenever you find time :) Thanks!

jeiros commented 7 years ago

Again, conda behaving weirdly:

$ conda upgrade htmd
Fetching package metadata ...............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /home/je714/anaconda3/envs/htmd-py35:
#
htmd                      1.7.11                   py35_0    acellera

If I make a new env with python 3.6 and the full anaconda installation: conda create --name htmd-py36 anaconda -y and then look for HTMD

$ anaconda search htmd
Using Anaconda API: https://api.anaconda.org
Run 'anaconda show <USER/PACKAGE>' to get more details:
Packages:
     Name                      |  Version | Package Types   | Platforms
     ------------------------- |   ------ | --------------- | ---------------
     acellera-basic/htmd       |   1.0.10 | conda           | linux-64, win-64, osx-64
     acellera-basic/htmd-data  |   0.0.33 | conda           | linux-64, win-64, osx-64
     acellera-basic/htmdbabel  |   2.3.95 | conda           | linux-64, osx-64
     acellera/HTMD             |   1.7.13 | conda           | linux-64, win-64, osx-64
                                          : High Throughput Molecular Dynamics

The new release is there. But:

$ conda install -c acellera htmd=1.7.13
Fetching package metadata ...............

PackageNotFoundError: Package not found: '' Package missing in current linux-64 channels:
  - htmd 1.7.13*

You can search for packages on anaconda.org with

    anaconda search -t conda htmd

conda install seems to still be picking up the 1.7.11 release. I'll give it a bit of time to see if it detects the new release (?)

stefdoerr commented 7 years ago

Maybe try conda uninstall htmd --force and install again? Seems to help sometimes.

Works fine on my fresh py36 miniconda install.

jeiros commented 7 years ago

That still picks up the 1.7.11 version

(htmd-py35) je714@titanx2:~$ conda uninstall htmd --force
Fetching package metadata ...............

Package plan for package removal in environment /home/je714/anaconda3/envs/htmd-py35:

The following packages will be REMOVED:

    htmd: 1.7.11-py35_0 acellera

Proceed ([y]/n)? y

(htmd-py35) je714@titanx2:~$ conda install htmd
Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /home/je714/anaconda3/envs/htmd-py35:

The following NEW packages will be INSTALLED:

    htmd: 1.7.11-py35_0 acellera

Proceed ([y]/n)? y

stefdoerr commented 7 years ago

But it's trying to pull 3.5 so weird. You can also see here that both 3.5 and 3.6 versions are available for 1.7.13: https://anaconda.org/acellera/HTMD/files

I am sorry, I can't help beyond suggesting a fresh install :disappointed: Conda is just annoying sometimes...

jeiros commented 7 years ago

Yes it's weird that anaconda search htmd finds the new version but conda install -c acellera htmd=1.7.13 doesn't do anything. I'll give it some time 😕

Note: Solve it by doing:

$ wget https://anaconda.org/acellera/HTMD/1.7.13/download/linux-64/htmd-1.7.13-py35_0.tar.bz2
$ conda install --offline htmd-1.7.13-py35_0.tar.bz2

not the prettiest, but I think it worked:

$ conda list | grep htmd
htmd                      1.7.13                   py35_0    file:///home/je714

j3mdamas commented 7 years ago

I think it's just because you're on OSX and we did not make sure the OSX build passed. I am going to restart the OSX build and I'll let you know when it's available.

On Mon, Mar 13, 2017 at 5:44 PM, Juan Eiros notifications@github.com wrote:

Yes it's weird that anaconda search htmd finds the new version but conda install -c acellera htmd=1.7.13 doesn't do anything. I'll give it some time 😕

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Acellera/htmd/issues/255#issuecomment-286166750, or mute the thread https://github.com/notifications/unsubscribe-auth/AKTzQj7kGucU0rsLOrosNatyS5vZUAzSks5rlXJYgaJpZM4L_eyy .

-- [image: https://www.acellera.com/] https://www.acellera.com/ https://twitter.com/acellera https://www.youtube.com/user/acelleralive https://www.linkedin.com/company/acellera https://www.acellera.com/md-simulation-blog-news/ http://is.gd/1eXkbS

jeiros commented 7 years ago

I'm running this on a Linux machine since that's where I have the GPUs

stefdoerr commented 7 years ago

I have a suspicion this might be related to the automatical dependency generation. I will take a look at it tomorrow

mj-harvey commented 7 years ago

Were you starting with a fresh htmd-py35 environment? If not, make a new one and try again. Shoudnt be any need to use 3.5 anymore - our release for 3.6 is out

jeiros commented 7 years ago

Hi, I managed to get the 1.7.13 version for a python 3.6 environment.

Starting from 0, creating input files:

ProdTest = Production()
ProdTest.amber.nstlim = 2500
ProdTest.amber.ntx = 2
ProdTest.amber.irest = 0
ProdTest.amber.parmfile = 'structure.prmtop'
ProdTest.amber.coordinates = 'structure.ncrst'
ProdTest.amber.dt = 0.004
ProdTest.amber.ntpr = 500
ProdTest.amber.ntwr = 500
ProdTest.amber.ntwx = 250

ProdTest.amber.ntwx = 250 
ProdTest.write('./', './ready')

This gives the following:

$ ll ready/
total 63012
-rw-rw-r-- 1 je714 je714      122 Mar 14 10:25 MD.sh
-rw-rw-r-- 1 je714 je714      230 Mar 14 10:25 Production.in
-rw-r--r-- 1 je714 je714 14509680 Mar 14 10:25 structure.ncrst
-rw-r--r-- 1 je714 je714 50002440 Mar 14 10:25 structure.prmtop
$  cat ready/MD.sh
ENGINE -O -i Production.in -o Production.out -p structure.prmtop -c structure.ncrst -x Production.nc -r Production_new.rst

Using @stefdoerr commands:

app = htmd.LocalGPUQueue()
app.devices = [0,]

adapt = htmd.AdaptiveMD()
adapt.nmin = 1
adapt.nmax = 3
adapt.nepochs = 10
adapt.updateperiod = 100
adapt.projection = htmd.projections.metricdistance.MetricDistance(sel1='resname LIG and name C6 C10 C19', sel2='name CA')
adapt.app = app
adapt.filtersel = 'not water and not resname "Na\+" "Cl\-"'
adapt.generatorspath = './ready'
adapt.inputpath = './input'
adapt.datapath = './data'
adapt.filteredpath = './filtered'
adapt.coorname = 'structure.ncrst'
adapt.run()

Fails with the following logs & Traceback:

2017-03-14 10:25:47,756 - htmd.adaptive.adaptive - INFO - Processing epoch 0
2017-03-14 10:25:47,758 - htmd.adaptive.adaptive - INFO - Epoch 0, generating first batch
2017-03-14 10:25:47,759 - htmd.adaptive.adaptive - INFO - Generators folder has no subdirectories, using folder itself
2017-03-14 10:25:47,932 - htmd.queues.localqueue - INFO - Using GPU devices 0
2017-03-14 10:25:47,934 - htmd.queues.localqueue - INFO - Queueing /home/je714/try_adaptive/from_manual_build/input/e1s1_ready
2017-03-14 10:25:47,935 - htmd.queues.localqueue - INFO - Running /home/je714/try_adaptive/from_manual_build/input/e1s1_ready on GPU device 0
2017-03-14 10:25:47,935 - htmd.queues.localqueue - INFO - Queueing /home/je714/try_adaptive/from_manual_build/input/e1s2_ready
2017-03-14 10:25:47,939 - htmd.queues.localqueue - INFO - Queueing /home/je714/try_adaptive/from_manual_build/input/e1s3_ready
2017-03-14 10:25:47,945 - htmd.adaptive.adaptive - INFO - Sleeping for 100 seconds.
2017-03-14 10:25:47,948 - htmd.queues.localqueue - INFO - Error in simulation /home/je714/try_adaptive/from_manual_build/input/e1s1_ready. Command '/home/je714/try_adaptive/from_manual_build/input/e1s1_ready/job.sh' returned non-zero exit status 127.
2017-03-14 10:25:47,950 - htmd.queues.localqueue - INFO - Running /home/je714/try_adaptive/from_manual_build/input/e1s2_ready on GPU device 0
2017-03-14 10:25:47,958 - htmd.queues.localqueue - INFO - Error in simulation /home/je714/try_adaptive/from_manual_build/input/e1s2_ready. Command '/home/je714/try_adaptive/from_manual_build/input/e1s2_ready/job.sh' returned non-zero exit status 127.
2017-03-14 10:25:47,960 - htmd.queues.localqueue - INFO - Running /home/je714/try_adaptive/from_manual_build/input/e1s3_ready on GPU device 0
2017-03-14 10:25:47,970 - htmd.queues.localqueue - INFO - Error in simulation /home/je714/try_adaptive/from_manual_build/input/e1s3_ready. Command '/home/je714/try_adaptive/from_manual_build/input/e1s3_ready/job.sh' returned non-zero exit status 127.
2017-03-14 10:27:27,953 - htmd.adaptive.adaptive - INFO - Processing epoch 1
2017-03-14 10:27:27,955 - htmd.adaptive.adaptive - INFO - Retrieving simulations.
2017-03-14 10:27:27,957 - htmd.adaptive.adaptive - INFO - 0 simulations in progress
2017-03-14 10:27:27,959 - htmd.adaptive.adaptiverun - INFO - Postprocessing new data
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-11-828aa70f8ca4> in <module>()
     15 adapt.filteredpath = './filtered'
     16 adapt.coorname = 'structure.ncrst'
---> 17 adapt.run()

/home/je714/anaconda3/envs/htmd-py36/lib/python3.6/site-packages/htmd/adaptive/adaptive.py in run(self)
     97                 # If currently running simulations are lower than nmin start new ones to reach nmax number of sims
     98                 if self._running <= self.nmin and epoch < self.nepochs:
---> 99                     flag = self._algorithm()
    100                     if flag is False:
    101                         self._unsetLock()

/home/je714/anaconda3/envs/htmd-py36/lib/python3.6/site-packages/htmd/adaptive/adaptiverun.py in _algorithm(self)
    123 
    124     def _algorithm(self):
--> 125         data = self._getData(self._getSimlist())
    126         if not self._checkNFrames(data): return False
    127         self._createMSM(data)

/home/je714/anaconda3/envs/htmd-py36/lib/python3.6/site-packages/htmd/adaptive/adaptiverun.py in _getSimlist(self)
    140         logger.info('Postprocessing new data')
    141         sims = simlist(glob(path.join(self.datapath, '*', '')), glob(path.join(self.inputpath, '*', 'structure.pdb')),
--> 142                        glob(path.join(self.inputpath, '*', '')))
    143         if self.filter:
    144             sims = simfilter(sims, self.filteredpath, filtersel=self.filtersel)

/home/je714/anaconda3/envs/htmd-py36/lib/python3.6/site-packages/htmd/simlist.py in simlist(datafolders, molfiles, inputfolders)
    132 
    133     if not datafolders:
--> 134         raise FileNotFoundError('No data folders were given, check your arguments.')
    135     if not molfiles:
    136         raise FileNotFoundError('No molecule files were given, check your arguments.')

FileNotFoundError: No data folders were given, check your arguments.

It's failing to launch the simulations. From what I've seen , the job.sh script is expectin a run.sh script to launch the simulations with the appropriate command but the one produced by the htmd.protocols.pmemdproduction.Production class is called MD.sh.

$ l input/e1s1_ready/
total 62M
-rw-rw-r-- 1 je714 122 Mar 14 10:25 MD.sh
-rw-rw-r-- 1 je714 230 Mar 14 10:25 Production.in
-rwx------ 1 je714 172 Mar 14 10:25 job.sh
-rw-r--r-- 1 je714 14M Mar 14 10:25 structure.ncrst
-rw-r--r-- 1 je714 48M Mar 14 10:25 structure.prmtop

$ cat input/e1s1_ready/job.sh
#!/bin/bash

export CUDA_VISIBLE_DEVICES=0
cd /home/je714/try_adaptive/from_manual_build/input/e1s1_ready
/home/je714/try_adaptive/from_manual_build/input/e1s1_ready/run.sh

Also, we are not doing

adapt.app = htmd.apps.pmemdlocal.PmemdLocal(
    pmemd='/usr/local/amber/bin/pmemd.cuda_SPFP',
    datadir='./data',
    devices=[0, 1, 2, 3])

but

adapt.app = app

where app is a htmd.LocalGPUQueue object. So the MD.sh doesn't know what ENGINE it should use and is not overwritten:

$ cat input/e1s1_ready/MD.sh
ENGINE -O -i Production.in -o Production.out -p structure.prmtop -c structure.ncrst -x Production.nc -r Production_new.rst

stefdoerr commented 7 years ago

The pmemdproduction class needs rewriting indeed. For the moment I would suggest you just write a run.sh script manually which will then be called by LocalGPUQueue with the job.sh script.

To make it clearer: We decided to separate the queuing systems from the simulation software (hence why we don't use Apps anymore and we use Queues). They work as follows: the protocols are software specific (ours are for Acemd, you wrote the one for pmemd) and they write out a run.sh file which should be standalone enough to execute a simulation. Then the queueing classes simply write a job.sh which does some queue-specific stuff like hiding all GPUs except one, or submitting to SLURM and then call your run.sh which runs the simulation.

So now the problem is that the pmemdproduction module is out of date and needs updating to the realities of the new world. The engine will need to passed to the pmemdproduction class so that it writes it into the run.sh script.

On the matter of the error with simlist: The error just tells you that you don't have any subfolders in your data folder while you do have some called eXsX in input folder and also your retrieve method didn't make any data folders. It might be in a way a minor bug. You can fix it by starting from epoch 1 since I see that you have simulations for epoch 1 and putting them into your data directory.

stefdoerr commented 7 years ago

So to summarize the only two problems here are: a) We need to add the ENGINE to the PMEMD Production protocol to write it to the run.sh file b) Rename the MD.sh to run.sh

Right? I can do that.

jeiros commented 7 years ago

Thanks for clarfiying. I got it to work for the moment by using

app = htmd.apps.pmemdlocal.PmemdLocal(
    pmemd='/usr/local/amber/bin/pmemd.cuda_SPFP',
    datadir='./data',
    devices=[0, 1, 2, 3])

adapt = htmd.AdaptiveMD()
adapt.nmin = 1
adapt.nmax = 3
adapt.nepochs = 10
adapt.updateperiod = 100
adapt.projection = htmd.projections.metricdistance.MetricDistance(sel1='resname LIG and name C6 C10 C19', sel2='name CA')
adapt.app = app
adapt.filtersel = 'not water and not resname "Na\+" "Cl\-"'
adapt.generatorspath = './ready'
adapt.inputpath = './input'
adapt.datapath = './data'
adapt.filteredpath = './filtered'
adapt.coorname = 'structure.ncrst'
adapt.run()

But I'll switch to using the queues.

jeiros commented 7 years ago

Yes that would be it. Don't worry about it, I can play around with it and submit a PR once I think it's working fine with the queues.

To make 'changes' to an htmd installation, here's what I do:

Conda install htmd-deps on a new conda environment
Clone the git repo
export PYTHONPATH to the repo path
cd into the htmd/ repo and run python setup.py install --user

Is that how you go about it?

I am not too sure it's working for me since I keep getting HTMD: Logging setup failed when I import htmd

stefdoerr commented 7 years ago

No, sorry. The setup.py doesn't work as far as I know. I just do:

clone repo
prepend it to the PYTHONPATH
Use the normal conda environment

That's it.

The only issue might be the C .so libs which you might have to copy from the conda installation (do htmd.home() to see where it's installed) to the equivallent git folder.

j3mdamas commented 7 years ago

The setup.py is for pypi packaging, as far as I recall and it's still under development (#237)

stefdoerr commented 7 years ago

@jeiros I added automatic topology detection now to HTMD in the latest commit https://github.com/Acellera/htmd/commit/3d60ae316922e606d82d919e1272bd2e4312e9d3

So no reason to modify adaptiverun.py anymore to read prmtop files.

stefdoerr commented 7 years ago

@jeiros I am going to close this. Made a new issue for it

Acellera / htmd

Running adaptive sampling using AMBER on a cluster using SLURM or PBS #255