choderalab / yank

An open, extensible Python framework for GPU-accelerated alchemical free energy calculations.
http://getyank.org
MIT License
177 stars 70 forks source link

Plan general setup workflow for protein+ligand simulations #248

Open davidlmobley opened 9 years ago

davidlmobley commented 9 years ago

Related to #247 - currently Yank has systembuilder, which probably ought to be split off into a more general-purpose tool that will be useful aside from just for Yank workflows (i.e. setup of GROMACS or AMBER simulations as well).

At this point, I have in mind particularly the workflow for setting up protein+ligand simulations given "well prepared" structures of both protein + ligand. I think independently or together we've got fairly good workflows (or at least workflows which are under development) for fixing broken proteins/building the proteins we want, and for setting up ligands (including after docking them, if needed). To some degree I see the major hurdle as linking these parts of the process up (combining the protein+ligand and assigning final parameters). More generally we could think of the workflow as all-encompassing (protein setup + ligand placement/parameterization, etc.) but that's a larger problem.

Some things I think we need to address in the short-term are:

  1. What would this workflow do (in the short term)? The thing I most immediately need is "if I have a clean, prepared protein PDB X, and a clean, charged GAFF ligand Y (with frcmod and mol2 files), how should I combine and solvate these?" AMBER's tleap is the normal way, but simply writing wrappers for tleap is starting to feel archaic. What immediately appeals to me is to use some toolkit (AMBER tleap or GROMACS tools) to get the protein and ligand into separate parameter files (tleap for the ligand) and then ParmEd to combine them. It doesn't handle solvation, however, so how should that be done? I can use gromacs solvation tools to do it, but again this feels archaic. I did notice that openmm has in app a modeller functionality which has addSolvent functionality, but I am new to this ecosystem so I'm not sure if that would be a good way to do it. Also - what if I want to solvate my protein+ligand in non-water (i.e. water+DMSO, or water+TFE, or...)? Or should we build on what we've already done with packmol for pure solvents/mixtures and just use packmol as a general tool for solvation?
  2. What code base/object type? Should we be working within ParmEd or OpenMM? If OpenMM, will we be able to get prepared systems out and into some other format if we need? If not, that will impact our user base. If ParmEd or other, can we get prepared systems INTO OpenMM if needed? Will that bypass xml format forcefield stuff? What are the implications of this?
  3. What longer-term needs do we have for this workflow and what's the game plan for addressing them? Specifically, there are the usual problems with missing residues/loops in proteins, protonation states of protein residues, etc. On the ligand side, we might also be interested in general strategies for generating starting placements of ligands when crystal structures aren't available (OEDocking toolkit?) etc.

Here I'm hoping we'll at least be able to address 1 and 2.

davidlmobley commented 9 years ago

Relatedly, I need to kick off a larger discussion with Hannes Loeffler and Julien Michel about FESetup and how we can complement efforts and avoid duplication. Hannes has just become aware of ParmEd so that may help with some aspects of it.

davidlmobley commented 9 years ago

@swails may be interested in this as well.

swails commented 9 years ago

What immediately appeals to me is to use some toolkit (AMBER tleap or GROMACS tools) to get the protein and ligand into separate parameter files (tleap for the ligand) and then ParmEd to combine them.

What's the advantage of this approach over purely AMBER or purely GROMACS? tleap can combine stuff just fine, and then ParmEd can split out the pieces to a Gromacs topology file later. The hybrid approach you described should work, it just seems overly complex to me (apologies if I'm missing some critical detail here).

Another option could be to use gaff2xml with OpenMM's modeller tool. This generates an OpenMM System object which you can feed to ParmEd to create a Structure. But it's less flexible than doing tleap or GROMACS setup -> ParmEd Structure.

I can use gromacs solvation tools to do it, but again this feels archaic. I did notice that openmm has in app a modeller functionality which has addSolvent functionality, but I am new to this ecosystem so I'm not sure if that would be a good way to do it.

Check out PDBFixer (https://github.com/pandegroup/PDBFixer). I think this could satisfy a large number of your requirements. You can then take the resulting structure, write a PDB, and generate parameters using either tleap or GROMACS (or OpenMM) and load those into ParmEd to be passed around as needed.

Or should we build on what we've already done with packmol for pure solvents/mixtures and just use packmol as a general tool for solvation?

I think packmol could be a good way to go. We can wrap for now and in the long-term see if the packmol developers would be interested in collaborating on building a Python API that could be hooked into the conda ecosystem nicely. I'm personally anti-reinventing-the-wheel (hence ParmEd's main goal), so if we can get the packmol devs excited/interested (or maybe just not resistant) to the idea, I think everyone wins.

What code base/object type? Should we be working within ParmEd or OpenMM? If OpenMM, will we be able to get prepared systems out and into some other format if we need? If not, that will impact our user base. If ParmEd or other, can we get prepared systems INTO OpenMM if needed? Will that bypass xml format forcefield stuff? What are the implications of this?

I'm obviously biased here, but I think ParmEd fits the bill. I think the data structures are easier to work with for the types of things you'd commonly want to do, and it provides hooks into all the other programs you'd want to support. Sire is another option, but I think ParmEd may be more hackable (and more suited to doing system/parameter manipulation).

Anything that generates a topology with parameters in ParmEd can immediately be used to generate an OpenMM system. It can also read topology and parameter info from OpenMM and translate to other packages (like an Amber prmtop or Gromacs topology), but that way is a lot less flexible (e.g., once you have NBFIX terms, ParmEd gives up).

What longer-term needs do we have for this workflow and what's the game plan for addressing them? Specifically, there are the usual problems with missing residues/loops in proteins, protonation states of protein residues, etc.

PDBFixer has some functionality for modeling missing loops, but it's a far cry from homology modeling. I'm not aware of any Python-interfacing homology modeling toolkit out there, but I also haven't looked hard.

Relatedly, I need to kick off a larger discussion with Hannes Loeffler and Julien Michel about FESetup and how we can complement efforts and avoid duplication. Hannes has just become aware of ParmEd so that may help with some aspects of it.

I've talked with Hannes a little bit, and he uses ParmEd for specialized tasks in a couple places (he started using it a year ago or so if memory serves: http://archive.ambermd.org/201403/0471.html), but it was a far cry from the generalized tool it is now (I think it only supported Amber files then, and didn't even use the same object model).

davidlmobley commented 9 years ago

@swails - a few comments:

What's the advantage of this approach over purely AMBER or purely GROMACS? tleap can combine stuff just fine, and then ParmEd can split out the pieces to a Gromacs topology file later. The hybrid approach you described should work, it just seems overly complex to me (apologies if I'm missing some critical detail here).

In terms of pure GROMACS - GROMACS setup has very little support for ligands. So we normally use tleap. However, there are several (probably mostly minor) issues with this. One is that, as part of a Python workflow, one gets stuck writing tleap scripts within python, which seems a little silly/frustrating. We might write a really simple Python function that uses tleap to just set up a protein plus a single ligand, but then someone will say, "OK, but can you add an option so I can use a different protein forcefield?" or "what if my protein has two copies of the ligand bound simultaneously in different positions, can you generalize to handle that?" and it ends up feeling like we're trying to write an API to tleap in a cumbersome way. I suppose that can be handled by parameterizing the components, putting them into ParmEd and then combining...

But more importantly, what if I want to use a CHARMM or OPLS forcefield? Let's assume for a moment that I can get the ligand parameters somehow (which is trickier). But if I do, how do I proceed? Ideally I'd like to be able to use the same workflow irrespective of forcefield - take a particular protein, predict protonation states etc, build in missing residues, parameterize protein+ligand, construct combined system, solvate, add ions, etc. Choice of forcefield ought to be just an argument somewhere, not something which dictates a particular choice of tools and requires learning a new toolchain.

On most of the rest, I'll mostly wait for the other guys to chime in. Though I should mention that Modeller (UCSF) has a Python interface, and @jchodera's group has been working on a tool called Ensembler which covers some aspects of this. https://github.com/choderalab/ensembler.

I definitely agree with you that I don't want to reinvent the wheel, so if I'm missing something and tleap (or other tools) are the ideal solution here and nothing new is needed, I'll be delighted.

rmcgibbo commented 9 years ago

Good discussion. Minor point: I was looking at the packmol code a while ago. I think it would be pretty straightforward to wrap, either with f2py or cython, if anyone wanted to.

Also, I think some of the packmol functionality has overlap with @ctk3b 's work at https://github.com/iModels/mbuild, so there could be some value in giving some of those algorithms a native python API. On Jun 11, 2015 12:24 PM, "David L. Mobley" notifications@github.com wrote:

@swails https://github.com/swails - a few comments:

What's the advantage of this approach over purely AMBER or purely GROMACS? tleap can combine stuff just fine, and then ParmEd can split out the pieces to a Gromacs topology file later. The hybrid approach you described should work, it just seems overly complex to me (apologies if I'm missing some critical detail here).

In terms of pure GROMACS - GROMACS setup has very little support for ligands. So we normally use tleap. However, there are several (probably mostly minor) issues with this. One is that, as part of a Python workflow, one gets stuck writing tleap scripts within python, which seems a little silly/frustrating. We might write a really simple Python function that uses tleap to just set up a protein plus a single ligand, but then someone will say, "OK, but can you add an option so I can use a different protein forcefield?" or "what if my protein has two copies of the ligand bound simultaneously in different positions, can you generalize to handle that?" and it ends up feeling like we're trying to write an API to tleap in a cumbersome way. I suppose that can be handled by parameterizing the components, putting them into ParmEd and then combining...

But more importantly, what if I want to use a CHARMM or OPLS forcefield? Let's assume for a moment that I can get the ligand parameters somehow (which is trickier). But if I do, how do I proceed? Ideally I'd like to be able to use the same workflow irrespective of forcefield - take a particular protein, predict protonation states etc, build in missing residues, parameterize protein+ligand, construct combined system, solvate, add ions, etc. Choice of forcefield ought to be just an argument somewhere, not something which dictates a particular choice of tools and requires learning a new toolchain.

On most of the rest, I'll mostly wait for the other guys to chime in. Though I should mention that Modeller (UCSF) has a Python interface, and @jchodera https://github.com/jchodera's group has been working on a tool called Ensembler which covers some aspects of this. https://github.com/choderalab/ensembler.

I definitely agree with you that I don't want to reinvent the wheel, so if I'm missing something and tleap (or other tools) are the ideal solution here and nothing new is needed, I'll be delighted.

— Reply to this email directly or view it on GitHub https://github.com/choderalab/yank/issues/248#issuecomment-111252248.

swails commented 9 years ago

Good discussion. Minor point: I was looking at the packmol code a while ago. I think it would be pretty straightforward to wrap, either with f2py or cython, if anyone wanted to.

I looked at that a bit ago as well. FTR, packmol is GPL, which would necessitate GPL-ing any code that depends on it.

Also, I think some of the packmol functionality has overlap with @ctk3b 's work at https://github.com/iModels/mbuild, so there could be some value in giving some of those algorithms a native python API.

I'll have a look (when I have time, of course).

ctk3b commented 9 years ago

I can't fully remember but I think the packmol wrapping in mbuild was inspired/partially adopted from something that Kyle Beauchamp had written in openmmtools or one of those packages.

ctk3b commented 9 years ago

https://github.com/choderalab/openmoltools/blob/master/openmoltools/packmol.py

swails commented 9 years ago

In terms of pure GROMACS - GROMACS setup has very little support for ligands. So we normally use tleap. However, there are several (probably mostly minor) issues with this. One is that, as part of a Python workflow, one gets stuck writing tleap scripts within python, which seems a little silly/frustrating. We might write a really simple Python function that uses tleap to just set up a protein plus a single ligand, but then someone will say, "OK, but can you add an option so I can use a different protein forcefield?" or "what if my protein has two copies of the ligand bound simultaneously in different positions, can you generalize to handle that?" and it ends up feeling like we're trying to write an API to tleap in a cumbersome way. I suppose that can be handled by parameterizing the components, putting them into ParmEd and then combining...

This is something I've thought a lot about. One of my long-range goals for ParmEd is to create a parametrization tool (a bit like OpenMM's Modeler class) that can build a structure from any parameter set it reads (which can come from any file format that's supported). It would abstract away the details of file formats, and provide a single program capable of acting as a setup engine for any program (starting from the residue template and parameter files from any other program). I still want to rely on other tools (preferably through their Python APIs) for a lot of the modeling, like filling in missing residues, loops, parametrizing custom residues, solvating, etc. (I have to pick what to try and be good at).

But this isn't even on the radar yet, and probably won't happen unless I find a permanent position (probably academic) that will allow me to continue to devote time to ParmEd.

jchodera commented 9 years ago

I think it is worthwhile brainstorming use cases that we'd eventually like to support, and then figure out what strategies get us the most coverage with the least pain, both initially and for the eventual evolution of the tool to cover all the use cases (and new ones not yet dreamed up).

Practically, the strategies will likely involve one of two tools in the short term: Either (1) the OpenMM ForceField class, or (2) ParmEd. They have very different advantages. ForceField maximizes the ability to "mix and match" forcefields with systems (provided the forcefields have been converted to ffxml), while ParmEd supports reading native forcefield definition files but may not allow as much "mix and match" flexibility. In some sense, the choice may depend on how much we need to explore various forcefields.

Related questions for @swails:

swails commented 9 years ago

Practically, the strategies will likely involve one of two tools in the short term: Either (1) the OpenMM ForceField class, or (2) ParmEd. They have very different advantages. ForceField maximizes the ability to "mix and match" forcefields with systems (provided the forcefields have been converted to ffxml), while ParmEd supports reading native forcefield definition files but may not allow as much "mix and match" flexibility. In some sense, the choice may depend on how much we need to explore various forcefields.

A couple comments here. I don't really agree that ForceField maximizes the ability to "mix-and-match" force fields. Unless I'm mistaken, you need to actually create the mixed-and-matched XML files somehow, and therein lies all of the work. Peter has scripts that do it for some CHARMM force field files; some Amber force field files; some Tinker force field files (he hasn't written all of those scripts). But they're not what I would call inter-convertible, and they don't play well with one another (because they're not meant to). Or are the ForceField/modeling classes way more flexible than I think they are?

IMO, ParmEd is better-equipped to mix-and-match. Parametrize each "piece" as a separate Structure object, then combine them using the addition operator. That's all there is to it. For now you have to go through other utilities to actually assign topology and parameters (you can use tleap, gromacs, or even OpenMM's ForceField), and ParmEd will read the Amber prmtop, Gromacs top, CHARMM PSF/par/rtf/str, or OpenMM Topology/System and create a Structure object that you can combine with one another and translate between the different file formats.

For example, parametrize the protein with OpenMM and the ligand with antechamber/parmchk/tleap/gaff, read them both into ParmEd, combine them with "+", and you're done.

Does ParmEd assign parameters by atom name, which could cause problems for forcefields that did not utilize standard PDB atom naming conventions?

ParmEd does not assign parameters yet in that sense. The only parameter assignment it does is for CHARMM and Gromacs PSF/top files, and it does that based on the atom types defined in the PSF or top and matching it to the same atom type name defined in the CHARMM PAR or Gromacs ITP files... exactly the same way that CHARMM or Gromacs itself would do.

The way I would have ParmEd assign parameters when it can actually do what CHARMM GUI/pdb2gmx/tleap does is the same way those programs do it -- you need to have residue templates that assign atom types matching to the parameter files. There's really no other way I can think of to do that.

Had you considered giving ParmEd the ability to generate OpenMM ffxml files?

It's on the (very long) to-do list. I want ParmEd to be able to read and write them. Prerequisite functionality needed to support that:

Once that is done, you should be able to convert any program's parameter files to any others (including ffxml). There are other aspects of ParmEd that need work as well (top of the list is coordinate handling, I think), so it will be awhile before I can get to this.

davidlmobley commented 9 years ago

@jchodera - can you expand on what you mean by this?

Practically, the strategies will likely involve one of two tools in the short term: Either (1) the OpenMM ForceField class, or (2) ParmEd. They have very different advantages. ForceField maximizes the ability to "mix and match" forcefields with systems (provided the forcefields have been converted to ffxml), while ParmEd supports reading native forcefield definition files but may not allow as much "mix and match" flexibility. In some sense, the choice may depend on how much we need to explore various forcefields.

Having not used this aspect yet, I don't see how it would improve mixing and matching (nor am I sure I totally know what you mean by that). Naively it seems like with ParmEd I already have that already - if I set up a protein with AMBER forcefield XX and a nucleic acid with AMBER forcefield YY and a ligand with GAFF and parameterize a cofactor by hand, I can just read the four into ParmEd and combine them using the + operator. OK, yes, I still have to deal with the parameterization of the individual objects with whatever tools... But once that's done, combining allows mixing and matching it seems, and then I can easily go straight into OpenMM by making a System, or dump to GROMACS or AMBER format. How does the ForceField class allow more mixing and matching or make it easier?

davidlmobley commented 9 years ago

@jchodera --

I think it is worthwhile brainstorming use cases that we'd eventually like to support, and then figure out what strategies get us the most coverage with the least pain, both initially and for the eventual evolution of the tool to cover all the use cases (and new ones not yet dreamed up).

But in terms of use cases, here are ones I immediately have in mind:

  1. Parameterize a GAFF ligand and possibly cofactors and (separately) a macromolecule (protein/nucleic acids)* and combine into a final system then solvate in water (plus neutralizing ions and specified ionic strength) for simulation in GROMACS, AMBER, or OpenMM 1a. Same as 1 but add a covalently bonded cofactor - i.e. a heme bound to the protein (and possibly also to the ligand). Tricky and probably requires an expert to do the setup in something like tleap before composing the final system 1b. Same as 1 but solvate in arbitrary mixture of solvents/buffers
  2. (for other people) Place protein+ligand system in membrane for simulation (i.e. 1 but in a membrane with a water environment)
  3. Same as 1 but in droplets
  4. Same as 1 but in the presence of an interface (membrane interface or liquid-vapor interface or perhaps even (for other people) a solid interface such as graphite )

* For our purposes here i'm willing to assume I already have a complete macromolecule with no missing residues and a protonation state I'm happy with, etc., and likewise for the ligand.

Probably there are others I'm not thinking of right now. What is it you would need?

jchodera commented 8 years ago

@andrrizzi has part of this done, but there are still things we plan to address.

Lnaden commented 7 years ago

Plenty to do yet, plenty of prep is in there now from the YAML building, but there is more that will need to be done going forward.

1.0 touch