Open davidlmobley opened 9 years ago
Relatedly, I need to kick off a larger discussion with Hannes Loeffler and Julien Michel about FESetup and how we can complement efforts and avoid duplication. Hannes has just become aware of ParmEd so that may help with some aspects of it.
@swails may be interested in this as well.
What immediately appeals to me is to use some toolkit (AMBER tleap or GROMACS tools) to get the protein and ligand into separate parameter files (tleap for the ligand) and then ParmEd to combine them.
What's the advantage of this approach over purely AMBER or purely GROMACS? tleap can combine stuff just fine, and then ParmEd can split out the pieces to a Gromacs topology file later. The hybrid approach you described should work, it just seems overly complex to me (apologies if I'm missing some critical detail here).
Another option could be to use gaff2xml with OpenMM's modeller tool. This generates an OpenMM System
object which you can feed to ParmEd to create a Structure
. But it's less flexible than doing tleap or GROMACS setup -> ParmEd Structure
.
I can use gromacs solvation tools to do it, but again this feels archaic. I did notice that openmm has in app a modeller functionality which has addSolvent functionality, but I am new to this ecosystem so I'm not sure if that would be a good way to do it.
Check out PDBFixer (https://github.com/pandegroup/PDBFixer). I think this could satisfy a large number of your requirements. You can then take the resulting structure, write a PDB, and generate parameters using either tleap or GROMACS (or OpenMM) and load those into ParmEd to be passed around as needed.
Or should we build on what we've already done with packmol for pure solvents/mixtures and just use packmol as a general tool for solvation?
I think packmol could be a good way to go. We can wrap for now and in the long-term see if the packmol developers would be interested in collaborating on building a Python API that could be hooked into the conda ecosystem nicely. I'm personally anti-reinventing-the-wheel (hence ParmEd's main goal), so if we can get the packmol devs excited/interested (or maybe just not resistant) to the idea, I think everyone wins.
What code base/object type? Should we be working within ParmEd or OpenMM? If OpenMM, will we be able to get prepared systems out and into some other format if we need? If not, that will impact our user base. If ParmEd or other, can we get prepared systems INTO OpenMM if needed? Will that bypass xml format forcefield stuff? What are the implications of this?
I'm obviously biased here, but I think ParmEd fits the bill. I think the data structures are easier to work with for the types of things you'd commonly want to do, and it provides hooks into all the other programs you'd want to support. Sire is another option, but I think ParmEd may be more hackable (and more suited to doing system/parameter manipulation).
Anything that generates a topology with parameters in ParmEd can immediately be used to generate an OpenMM system. It can also read topology and parameter info from OpenMM and translate to other packages (like an Amber prmtop or Gromacs topology), but that way is a lot less flexible (e.g., once you have NBFIX terms, ParmEd gives up).
What longer-term needs do we have for this workflow and what's the game plan for addressing them? Specifically, there are the usual problems with missing residues/loops in proteins, protonation states of protein residues, etc.
PDBFixer has some functionality for modeling missing loops, but it's a far cry from homology modeling. I'm not aware of any Python-interfacing homology modeling toolkit out there, but I also haven't looked hard.
Relatedly, I need to kick off a larger discussion with Hannes Loeffler and Julien Michel about FESetup and how we can complement efforts and avoid duplication. Hannes has just become aware of ParmEd so that may help with some aspects of it.
I've talked with Hannes a little bit, and he uses ParmEd for specialized tasks in a couple places (he started using it a year ago or so if memory serves: http://archive.ambermd.org/201403/0471.html), but it was a far cry from the generalized tool it is now (I think it only supported Amber files then, and didn't even use the same object model).
@swails - a few comments:
What's the advantage of this approach over purely AMBER or purely GROMACS? tleap can combine stuff just fine, and then ParmEd can split out the pieces to a Gromacs topology file later. The hybrid approach you described should work, it just seems overly complex to me (apologies if I'm missing some critical detail here).
In terms of pure GROMACS - GROMACS setup has very little support for ligands. So we normally use tleap
. However, there are several (probably mostly minor) issues with this. One is that, as part of a Python workflow, one gets stuck writing tleap scripts within python, which seems a little silly/frustrating. We might write a really simple Python function that uses tleap to just set up a protein plus a single ligand, but then someone will say, "OK, but can you add an option so I can use a different protein forcefield?" or "what if my protein has two copies of the ligand bound simultaneously in different positions, can you generalize to handle that?" and it ends up feeling like we're trying to write an API to tleap
in a cumbersome way. I suppose that can be handled by parameterizing the components, putting them into ParmEd and then combining...
But more importantly, what if I want to use a CHARMM or OPLS forcefield? Let's assume for a moment that I can get the ligand parameters somehow (which is trickier). But if I do, how do I proceed? Ideally I'd like to be able to use the same workflow irrespective of forcefield - take a particular protein, predict protonation states etc, build in missing residues, parameterize protein+ligand, construct combined system, solvate, add ions, etc. Choice of forcefield ought to be just an argument somewhere, not something which dictates a particular choice of tools and requires learning a new toolchain.
On most of the rest, I'll mostly wait for the other guys to chime in. Though I should mention that Modeller (UCSF) has a Python interface, and @jchodera's group has been working on a tool called Ensembler which covers some aspects of this. https://github.com/choderalab/ensembler.
I definitely agree with you that I don't want to reinvent the wheel, so if I'm missing something and tleap
(or other tools) are the ideal solution here and nothing new is needed, I'll be delighted.
Good discussion. Minor point: I was looking at the packmol code a while ago. I think it would be pretty straightforward to wrap, either with f2py or cython, if anyone wanted to.
Also, I think some of the packmol functionality has overlap with @ctk3b 's work at https://github.com/iModels/mbuild, so there could be some value in giving some of those algorithms a native python API. On Jun 11, 2015 12:24 PM, "David L. Mobley" notifications@github.com wrote:
@swails https://github.com/swails - a few comments:
What's the advantage of this approach over purely AMBER or purely GROMACS? tleap can combine stuff just fine, and then ParmEd can split out the pieces to a Gromacs topology file later. The hybrid approach you described should work, it just seems overly complex to me (apologies if I'm missing some critical detail here).
In terms of pure GROMACS - GROMACS setup has very little support for ligands. So we normally use tleap. However, there are several (probably mostly minor) issues with this. One is that, as part of a Python workflow, one gets stuck writing tleap scripts within python, which seems a little silly/frustrating. We might write a really simple Python function that uses tleap to just set up a protein plus a single ligand, but then someone will say, "OK, but can you add an option so I can use a different protein forcefield?" or "what if my protein has two copies of the ligand bound simultaneously in different positions, can you generalize to handle that?" and it ends up feeling like we're trying to write an API to tleap in a cumbersome way. I suppose that can be handled by parameterizing the components, putting them into ParmEd and then combining...
But more importantly, what if I want to use a CHARMM or OPLS forcefield? Let's assume for a moment that I can get the ligand parameters somehow (which is trickier). But if I do, how do I proceed? Ideally I'd like to be able to use the same workflow irrespective of forcefield - take a particular protein, predict protonation states etc, build in missing residues, parameterize protein+ligand, construct combined system, solvate, add ions, etc. Choice of forcefield ought to be just an argument somewhere, not something which dictates a particular choice of tools and requires learning a new toolchain.
On most of the rest, I'll mostly wait for the other guys to chime in. Though I should mention that Modeller (UCSF) has a Python interface, and @jchodera https://github.com/jchodera's group has been working on a tool called Ensembler which covers some aspects of this. https://github.com/choderalab/ensembler.
I definitely agree with you that I don't want to reinvent the wheel, so if I'm missing something and tleap (or other tools) are the ideal solution here and nothing new is needed, I'll be delighted.
— Reply to this email directly or view it on GitHub https://github.com/choderalab/yank/issues/248#issuecomment-111252248.
Good discussion. Minor point: I was looking at the packmol code a while ago. I think it would be pretty straightforward to wrap, either with f2py or cython, if anyone wanted to.
I looked at that a bit ago as well. FTR, packmol is GPL, which would necessitate GPL-ing any code that depends on it.
Also, I think some of the packmol functionality has overlap with @ctk3b 's work at https://github.com/iModels/mbuild, so there could be some value in giving some of those algorithms a native python API.
I'll have a look (when I have time, of course).
I can't fully remember but I think the packmol wrapping in mbuild was inspired/partially adopted from something that Kyle Beauchamp had written in openmmtools or one of those packages.
In terms of pure GROMACS - GROMACS setup has very little support for ligands. So we normally use tleap. However, there are several (probably mostly minor) issues with this. One is that, as part of a Python workflow, one gets stuck writing tleap scripts within python, which seems a little silly/frustrating. We might write a really simple Python function that uses tleap to just set up a protein plus a single ligand, but then someone will say, "OK, but can you add an option so I can use a different protein forcefield?" or "what if my protein has two copies of the ligand bound simultaneously in different positions, can you generalize to handle that?" and it ends up feeling like we're trying to write an API to tleap in a cumbersome way. I suppose that can be handled by parameterizing the components, putting them into ParmEd and then combining...
This is something I've thought a lot about. One of my long-range goals for ParmEd is to create a parametrization tool (a bit like OpenMM's Modeler class) that can build a structure from any parameter set it reads (which can come from any file format that's supported). It would abstract away the details of file formats, and provide a single program capable of acting as a setup engine for any program (starting from the residue template and parameter files from any other program). I still want to rely on other tools (preferably through their Python APIs) for a lot of the modeling, like filling in missing residues, loops, parametrizing custom residues, solvating, etc. (I have to pick what to try and be good at).
But this isn't even on the radar yet, and probably won't happen unless I find a permanent position (probably academic) that will allow me to continue to devote time to ParmEd.
I think it is worthwhile brainstorming use cases that we'd eventually like to support, and then figure out what strategies get us the most coverage with the least pain, both initially and for the eventual evolution of the tool to cover all the use cases (and new ones not yet dreamed up).
Practically, the strategies will likely involve one of two tools in the short term: Either (1) the OpenMM ForceField
class, or (2) ParmEd
. They have very different advantages. ForceField
maximizes the ability to "mix and match" forcefields with systems (provided the forcefields have been converted to ffxml
), while ParmEd
supports reading native forcefield definition files but may not allow as much "mix and match" flexibility. In some sense, the choice may depend on how much we need to explore various forcefields.
Related questions for @swails:
ParmEd
assign parameters by atom name, which could cause problems for forcefields that did not utilize standard PDB atom naming conventions?ParmEd
the ability to generate OpenMM ffxml
files?Practically, the strategies will likely involve one of two tools in the short term: Either (1) the OpenMM ForceField class, or (2) ParmEd. They have very different advantages. ForceField maximizes the ability to "mix and match" forcefields with systems (provided the forcefields have been converted to ffxml), while ParmEd supports reading native forcefield definition files but may not allow as much "mix and match" flexibility. In some sense, the choice may depend on how much we need to explore various forcefields.
A couple comments here. I don't really agree that ForceField maximizes the ability to "mix-and-match" force fields. Unless I'm mistaken, you need to actually create the mixed-and-matched XML files somehow, and therein lies all of the work. Peter has scripts that do it for some CHARMM force field files; some Amber force field files; some Tinker force field files (he hasn't written all of those scripts). But they're not what I would call inter-convertible, and they don't play well with one another (because they're not meant to). Or are the ForceField/modeling classes way more flexible than I think they are?
IMO, ParmEd is better-equipped to mix-and-match. Parametrize each "piece" as a separate Structure
object, then combine them using the addition operator. That's all there is to it. For now you have to go through other utilities to actually assign topology and parameters (you can use tleap, gromacs, or even OpenMM's ForceField), and ParmEd will read the Amber prmtop, Gromacs top, CHARMM PSF/par/rtf/str, or OpenMM Topology/System and create a Structure
object that you can combine with one another and translate between the different file formats.
For example, parametrize the protein with OpenMM and the ligand with antechamber/parmchk/tleap/gaff, read them both into ParmEd, combine them with "+", and you're done.
Does ParmEd assign parameters by atom name, which could cause problems for forcefields that did not utilize standard PDB atom naming conventions?
ParmEd does not assign parameters yet in that sense. The only parameter assignment it does is for CHARMM and Gromacs PSF/top files, and it does that based on the atom types defined in the PSF or top and matching it to the same atom type name defined in the CHARMM PAR or Gromacs ITP files... exactly the same way that CHARMM or Gromacs itself would do.
The way I would have ParmEd assign parameters when it can actually do what CHARMM GUI/pdb2gmx/tleap does is the same way those programs do it -- you need to have residue templates that assign atom types matching to the parameter files. There's really no other way I can think of to do that.
Had you considered giving ParmEd the ability to generate OpenMM ffxml files?
It's on the (very long) to-do list. I want ParmEd to be able to read and write them. Prerequisite functionality needed to support that:
tinker
package to use the core classes (rather than their own versions)Once that is done, you should be able to convert any program's parameter files to any others (including ffxml). There are other aspects of ParmEd that need work as well (top of the list is coordinate handling, I think), so it will be awhile before I can get to this.
@jchodera - can you expand on what you mean by this?
Practically, the strategies will likely involve one of two tools in the short term: Either (1) the OpenMM ForceField class, or (2) ParmEd. They have very different advantages. ForceField maximizes the ability to "mix and match" forcefields with systems (provided the forcefields have been converted to ffxml), while ParmEd supports reading native forcefield definition files but may not allow as much "mix and match" flexibility. In some sense, the choice may depend on how much we need to explore various forcefields.
Having not used this aspect yet, I don't see how it would improve mixing and matching (nor am I sure I totally know what you mean by that). Naively it seems like with ParmEd I already have that already - if I set up a protein with AMBER forcefield XX and a nucleic acid with AMBER forcefield YY and a ligand with GAFF and parameterize a cofactor by hand, I can just read the four into ParmEd and combine them using the + operator. OK, yes, I still have to deal with the parameterization of the individual objects with whatever tools... But once that's done, combining allows mixing and matching it seems, and then I can easily go straight into OpenMM by making a System, or dump to GROMACS or AMBER format. How does the ForceField class allow more mixing and matching or make it easier?
@jchodera --
I think it is worthwhile brainstorming use cases that we'd eventually like to support, and then figure out what strategies get us the most coverage with the least pain, both initially and for the eventual evolution of the tool to cover all the use cases (and new ones not yet dreamed up).
But in terms of use cases, here are ones I immediately have in mind:
tleap
before composing the final system
1b. Same as 1 but solvate in arbitrary mixture of solvents/buffers*
For our purposes here i'm willing to assume I already have a complete macromolecule with no missing residues and a protonation state I'm happy with, etc., and likewise for the ligand.
Probably there are others I'm not thinking of right now. What is it you would need?
@andrrizzi has part of this done, but there are still things we plan to address.
Plenty to do yet, plenty of prep is in there now from the YAML building, but there is more that will need to be done going forward.
1.0 touch
Hi everyone, just commenting to see if there is any progress to build this set up? I need a link this workflow.
Related to #247 - currently Yank has systembuilder, which probably ought to be split off into a more general-purpose tool that will be useful aside from just for Yank workflows (i.e. setup of GROMACS or AMBER simulations as well).
At this point, I have in mind particularly the workflow for setting up protein+ligand simulations given "well prepared" structures of both protein + ligand. I think independently or together we've got fairly good workflows (or at least workflows which are under development) for fixing broken proteins/building the proteins we want, and for setting up ligands (including after docking them, if needed). To some degree I see the major hurdle as linking these parts of the process up (combining the protein+ligand and assigning final parameters). More generally we could think of the workflow as all-encompassing (protein setup + ligand placement/parameterization, etc.) but that's a larger problem.
Some things I think we need to address in the short-term are:
addSolvent
functionality, but I am new to this ecosystem so I'm not sure if that would be a good way to do it. Also - what if I want to solvate my protein+ligand in non-water (i.e. water+DMSO, or water+TFE, or...)? Or should we build on what we've already done with packmol for pure solvents/mixtures and just use packmol as a general tool for solvation?Here I'm hoping we'll at least be able to address 1 and 2.