How should we feed input to YANK?

jchodera commented 10 years ago

We have to make some decisions about how we tell YANK what we want it to do.

To be specific, we need to tell it:

What input to use for the ligand, which might be a mol2 file, SDF file, IUPAC or common name, AMBER prmtop/inpcrd pair, etc.
What input to use for the receptor, which might eventually be a PDB file, a PDB ID, an AMBER prmtop/inpcrd pair, or even another small molecule (as in the host-guest case).
If anything isn't parameterized, we need to tell YANK how to assign parameters.
There are some other things we may need to tell YANK about how to set up systems in explicit solvent or build in missing atoms/residues.
There are some run parameters too, like how many iterations to use, what kind of restraints, etc. Most of this should eventually be fully automated, but there are a few parameters right now.

We have a few options for how to specify this:

Python scripts that use the Yank module. All parameters are coded in Python.
Command-line scheme, perhaps using Robert's commandline tool, so we can say something like
- yank setup to set up a calculation
- yank run to run/resume a calculation
- yank info to get some quick info on progress
- yank analyze to analyze a calculation
Some sort of input parameter file format, like XML or JSON

Thoughts?

jchodera commented 10 years ago

Just bumping this thread.

In talking with @bas-rustenburg, we thought it might be useful to have separate command-line modes, something like

yank setup --receptor receptorspec --ligand ligandspec --forcefield amber99sbildn --ffgen gaff - set up using internal parameterization stuff (with help from OpenEye tools, OpenMM, pdbfixer, and gaff2xml)
yank ambersetup --prmtop prmtopfile --inpcrd inpcrdfile - set up from LEAP files

The current scheme would effectively become yank ambersetup, while the new scheme (yank setup) would try to parameterize whatever you threw at it (PDB, RCSB ID, mol2, SDF, IUPAC, SMILES, etc.).

jchodera commented 10 years ago

Some further refinements to this idea:

To set up from AMBER LEaP files:

yank import amber --ligand_prmtop ligand.prmtop --receptor_prmtop receptor.prmtop --complex_prmtop complex.prmtop --complex_crd complex.crd [--receptor_crd receptor.crd --ligand_crd ligand.crd]
other yank import filetype forms could be added later to allow import from CHARMM, gromacs, etc. as OpenMM adds support for them.

jchodera commented 10 years ago

To initialize YANK simulations using the OpenMM app and gaff2xml to build things, I think we want syntax like these use cases:

Set up a YANK calculation given a PDB file and a separate mol2 file containing several ligands and using default parameters and implicit solvent

# Set up protein (PDB) and ligands (mol2) in implicit solvent
yank setup --receptor receptor.pdb --receptor_forcefield ffxml:ff99sb --ligand ligands.mol2 --ligand_forcefield gaff2xml:gaff:bcc --implicit --destdir complexes/

There's actually a lot of data we might want to cram in there: which forcefield parameterization scheme to use (ffxml vs gaff2xml), which forcefield or parameterization scheme we want to use (ff99sb.xml, gaff.dat, bcc charges). I do wonder if we really need some sort of dict-like way to specify parameters, like a JSON or XML format for setting things up.

Set up a host-guest calculation, where both files are mol2 files:

# Set up host (mol2) and guests (mol2) in explicit solvent with 10 A buffer region
yank setup --receptor host.mol2 --receptor_forcefield gaff2xml:gaff:bcc --ligand guests.mol2 --ligand_forcefield gaff2xml:gaff:bcc --explicit "10*angstrom" --destdir host-guest/

Set up some of Ouathek's ideas to model complexes in implicit solvent

# Set up receptor (PDB) with some ChemDraw sketches
yank setup --receptor pdbid:3QCY --receptor_forcefield ffxml:ff99sb --ligand ideas.cdx --ligand_forcefield gaff2xml:gaff:bcc --implicit --destdir ouathek-ideas/

Just some thoughts. I think this needs further refinement.

jchodera commented 10 years ago

Maybe we should start collecting use cases on a wiki page?

kyleabeauchamp commented 10 years ago

We can discuss here, right? It's probably good to keep pinging people periodically otherwise people won't see the discussion.

jchodera commented 10 years ago

Sure, it's good to discuss here, but we can also compile (cut and paste) into the wiki once we have an idea of what real use cases are like.

kyleabeauchamp commented 10 years ago

yes

Lnaden commented 10 years ago

I think we should be cautious about requiring too many command line flags that all must be specified at once Too many and it will be very annoying to users, especially if a typo is made.

If we are wanting to make a large number of acceptable inputs, we should probably come up with a single input file that can either be written on its own (like an XML file), or at least have yank setup write to a common file so it can be copied, edited, loaded as needed. This way a user could run commands either all at once or in fragments to make it easier to maintain.

For instance, if one wants to set up the PDB and mol2 with several ligands, they would run

# Pull in receptor information
yank setup --receptor receptor.pdb --receptor_forcefield ffxml:ff99sb 
# Pull in ligand information
yank setup --ligand ligands.mol2 --ligand_forcefield gaff2xml:gaff:bcc 
# Set other flags
yank setup --implicit --destdir complexes/

and all of this would all write to a single, portable XML file. They could also run all of this in one line. Then if they want to run the same simulation but change the ligand forcefield, they would just rerun:

yank setup --ligand_forcefield ffxml:ff99sb

targeting the new forcefield file and changing the entry in the XML file.

I think this would make it easier for users to create and change simulations, and then it would also give a common file which could be passed to others wanting to repeat or slightly tweak a yank simulation. One drawback is yank run would need to validate that the XML was complete and/or fill in with defaults for missing keys.

danielparton commented 10 years ago

I also like the idea of using a single editable input file, with command-line flags for certain common options. I suggest command-line flags would take priority if the same fields are specified in the input file (and a note could be printed by the code to indicate this behavior to the user).

The other advantage here is that the input file can be referenced by the user (or another user) at a later point.

I find YAML is quite a nice format for user-editable files: http://www.yaml.org/start.html I definitely prefer it to editing XML, and the syntax is pretty intuitive.

Another option is to use a Python file (e.g. "yank_project_config.py") which can then be imported directly by Yank as a module.

On Thu, Jun 12, 2014 at 12:57 PM, Levi Naden notifications@github.com wrote:

I think we should be cautious about requiring too many command line flags that all must be specified at once Too many and it will be very annoying to users, especially if a typo is made.

If we are wanting to make a large number of acceptable inputs, we should probably come up with a single input file that can either be written on its own (like an XML file), or at least have yank setup write to a common file so it can be copied, edited, loaded as needed. This way a user could run commands either all at once or in fragments to make it easier to maintain.

For instance, if one wants to set up the PDB and mol2 with several ligands, they would run

Pull in receptor information

yank setup --receptor receptor.pdb --receptor_forcefield ffxml:ff99sb

Pull in ligand information

yank setup --ligand ligands.mol2 --ligand_forcefield gaff2xml:gaff:bcc

Set other flags

yank setup --implicit --destdir complexes/

and all of this would all write to a single, portable XML file. They could also run all of this in one line. Then if they want to run the same simulation but change the ligand forcefield, they would just rerun:

yank setup --ligand_forcefield ffxml:ff99sb

targeting the new forcefield file and changing the entry in the XML file.

I think this would make it easier for users to create and change simulations, and then it would also give a common file which could be passed to others wanting to repeat or slightly tweak a yank simulation. One drawback is yank run would need to validate that the XML was complete and/or fill in with defaults for missing keys.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/yank/issues/42#issuecomment-45918547.

kyleabeauchamp commented 10 years ago

I agree with Danny

jchodera commented 10 years ago

I like the idea of a file too, with command line overrides. If we used JSON (or YAML) and only permitted a few arguments to be overwritten (eg number of iterations) then the driver would not be too complex.

On the other hand, we could switch gears and just focus on the Python interfaces right now and have each setup script for a particular application actually be a Python script. That would be harder for novice users, but would give us maximum flexibility for what we need to do now without being locked into the effort of worrying about file and command line parsing...

kyleabeauchamp commented 10 years ago

On the other hand, we could switch gears and just focus on the Python interfaces right now and have each setup script for a particular application actually be a Python script.

I strongly agree with this idea--i.e. design the "objects" first and let the command line follow.

danielparton commented 10 years ago

+1

On Thu, Jun 12, 2014 at 2:02 PM, kyleabeauchamp notifications@github.com wrote:

On the other hand, we could switch gears and just focus on the Python interfaces right now and have each setup script for a particular application actually be a Python script.

I strongly agree with this idea--i.e. design the "objects" first and let the command line follow.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/yank/issues/42#issuecomment-45926497.

jchodera commented 10 years ago

OK, let's switch gears and sketch out the interface to tackle the kinds of use cases we want to deal with.

Some example use cases (an expanded version of the above):

AMBER inpcrd files for complex, ligand, receptor along with (possibly multi-model) PDB file or inpcrd file for complex
PDB file for receptor (a protein) to be parameterized with an ffxml file, multiple choices for ligand to be parameterized with gaff2xml, potentially including generating/expanding conformers or a molecular topology
- PDB file with one or more specified ligands (potentially selected out by residue names)
- mol2 file with one or more ligands
- SDF file with one or more ligands
- text file with list of one or more IUPAC names
- text file with one or more SMILES strings

Much of the file processing could be driver-specific. We essentially need to focus on exactly what kinds of data will go into the YANK-provided classes.

andrrizzi commented 8 years ago

All of this is implemented.

choderalab / yank

How should we feed input to YANK? #42

Pull in receptor information

Pull in ligand information

Set other flags