choderalab / protons

OpenMM testbed for constant-pH methodologies.
http://protons.readthedocs.io/
MIT License
21 stars 13 forks source link

Wisdoms, Part 1; the carboxylates #1

Closed swails closed 3 years ago

swails commented 10 years ago

I've skimmed through the source code here (mostly looking at class methods and docstrings) and I'm providing unsolicited comments based on my experience with Mongan's method and its implementation in Amber. I cleaned up and fixed the implementation in sander and implemented the method in pmemd and pmemd.cuda, so my experience is reasonably comprehensive. I'll share what I've learned so far in a series of posts. Some of these will appear in an upcoming publication (that's on my Ph.D. advisor's desk...)

The first is primarily relevant to carboxylate residues, but applies to some extent to every titratable residue. The good results reported by Mongan and myself [DOI 10.1021/ct300512h] for carboxylate residues is largely accidental and result from the dynamics adopting 'bad' conformations. When I tried using a hybrid GB/explicit method that samples conformations in explicit solvent and protonation states in GB, I found that the computed pKas of the carboxylate residues were systematically low (in several cases by more than 3-4 pK units). I was using the same GB model for protonation state sampling as I did for the all-implicit method, so the only difference with respect to the protonation state sampling was the difference in conformational state sampling between implicit and explicit calculations. The only way the implicit solvent calculations got reasonable carboxylate pKas was by adopting structures that resulted in more exposed carboxylates. After too many months of experimentation---and 3 weeks before my defense---I realized that the reason the carboxylate residues were being so badly underestimated was that the effective GB radii on the carboxylate functional group were about 0.1 A too large on average. It was pretty obvious in retrospect -- the ghost hydrogens--2 attached to each carboxylate oxygen--are responsible. The effective radius of the oxygens in the 'deprotonated' AS4 model compound is about 0.1 A larger than that of ASP. When I decreased AS4 and GL4 carboxylate oxygen radii by 0.1 A (and recomputed the reference energies), the RMSE of my computed pKas to experiment went from ~2.2 pK units to 0.7 - 0.8 pK units.

The best way to handle this would be to exclude 'inactive' protons from the effective radius calculation. I'll modify the OpenMM implementations of the Amber GB models and drop them in this repo. This would require recomputing the reference energies, but it may be worthwhile. I only had time to verify that uniformly reducing the oxygen radii by 0.1 A yielded good results, so I'm not sure what (if any) improvement to expect compared to computing the 'correct' effective radii. Regardless, it is important to account for this for carboxylate residues at least.

jchodera commented 10 years ago

I've skimmed through the source code here (mostly looking at class methods and docstrings) and I'm providing unsolicited comments based on my experience with Mongan's method and its implementation in Amber. I cleaned up and fixed the implementation in sander and implemented the method in pmemd and pmemd.cuda, so my experience is reasonably comprehensive. I'll share what I've learned so far in a series of posts.

Thank you for sharing this wisdom! It is very much appreciated, and we would love you to be involved in OpenMM-related constant-pH work if you had interest!

jchodera commented 10 years ago

The best way to handle this would be to exclude 'inactive' protons from the effective radius calculation. I'll modify the OpenMM implementations of the Amber GB models and drop them in this repo.

I currently have an "alchemically modified GB" implementation via CustomGBForce where an alchemical lambda parameter controls the contribution of each atom to the computation of effective Born radii and energy computations, so it's possible we can use this to test, but I agree that it would be much better and faster if the standard GB implementations excluded "ghost" hydrogens as you suggest.

This would require recomputing the reference energies, but it may be worthwhile.

I think we need to include a module to automate the reference energy calculation, since I am concerned this may need to be recomputed even when small simulation parameter changes are made. We can verify or refute this hypothesis once we have the automated tool.

swails commented 10 years ago

On Wed, Nov 6, 2013 at 4:55 PM, John Chodera notifications@github.comwrote:

The best way to handle this would be to exclude 'inactive' protons from the effective radius calculation. I'll modify the OpenMM implementations of the Amber GB models and drop them in this repo.

I currently have an "alchemically modified GB" implementation via CustomGBForce where an alchemical lambda parameter controls the contribution of each atom to the computation of effective Born radii and energy computations, so it's possible we can use this to test, but I agree that it would be much better and faster if the standard GB implementations excluded "ghost" hydrogens as you suggest.

I actually put together a kludgey sample in my fork that might make things easier. 2 lines of code should make AmberPrmtopFile use the new "exclude-ghost-atom" versions of the GB forces (except GBSAOBC2Force) instead of the default ones that ship with OpenMM.

This would require recomputing the reference energies, but it may be worthwhile.

I think we need to include a module to automate the reference energy calculation, since I am concerned this may need to be recomputed even when small simulation parameter changes are made. We can verify or refute this hypothesis once we have the automated tool.

I have done this so many times that I have written Python scripts to carry out TI to estimate the reference energy followed by pH-REMD to adjust the reference energy to the value that will give the correct model compound pKa.

They all use sander (or pmemd or pmemd.cuda) though... I could certainly share with you the scripts and/or the underlying workflow if you're interested.

Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher

jchodera commented 10 years ago

I have done this so many times that I have written Python scripts to carry out TI to estimate the reference energy followed by pH-REMD to adjust the reference energy to the value that will give the correct model compound pKa. They all use sander (or pmemd or pmemd.cuda) though... I could certainly share with you the scripts and/or the underlying workflow if you're interested.

That could be helpful! We may use alchemical (Hamiltonian) replica exchange + MBAR instead of TI (since it's easier to control the statistical error) and d/dlambda is harder to compute, but TI may actually have significant advantages in this case. Nonequilibrium methods (which generally are not faster, but would also be useful in tuning switching rates for explicit solvent) may also be useful.

I'm curious why you needed a two-stage workflow to first compute the reference free energy and then adjust to get the correct reference pKa. Shouldn't an accurate reference free energy guarantee the correct reference pKa in the pH-REMD calculation?

swails commented 10 years ago

On Wed, Nov 6, 2013 at 6:38 PM, John Chodera notifications@github.comwrote:

I have done this so many times that I have written Python scripts to carry out TI to estimate the reference energy followed by pH-REMD to adjust the reference energy to the value that will give the correct model compound pKa. They all use sander (or pmemd or pmemd.cuda) though... I could certainly share with you the scripts and/or the underlying workflow if you're interested.

That could be helpful! We may use alchemical (Hamiltonian) replica exchange + MBAR instead of TI (since it's easier to control the statistical error) and d/dlambda is harder to compute, but TI may actually have significant advantages in this case. Nonequilibrium methods (which generally are not faster, but would also be useful in tuning switching rates for explicit solvent) may also be useful.

I set up a github repo with all of my scripts (most of them will be of no interest, but ConstpH_TI.py and Ref*Titrate.py contains the relevant workflows).

I use ConstpH_TI.py to compute the free energy (done in a single step if you have scipy available) using TI in Amber. Then I use Ref(Rem)Titrate.py to perform a titration curve with sander (optionally using pH-REMD, which I highly recommend) to validate the reference energy computed with ConstpH_TI.py. (The reference energy from ConstpH_TI.py must be added to cpinutils/residues.py in the AmberTools distribution).

I'm curious why you needed a two-stage workflow to first compute the

reference free energy and then adjust to get the correct reference pKa. Shouldn't an accurate reference free energy guarantee the correct reference pKa in the pH-REMD calculation?

Not for the carboxylates. You need to correct for the entropy involved with having 2 degenerate syn-hydrogens and 2 degenerate anti-hydrogens. For the other residues, it's a way of validating the reference energy. In my experience, the results of the TI are good enough.

Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher

jchodera commented 10 years ago

Not for the carboxylates. You need to correct for the entropy involved with having 2 degenerate syn-hydrogens and 2 degenerate anti-hydrogens.

Is the entropic contribution analytically computable (e.g. kT ln N, where N is the degeneracy)?

swails commented 10 years ago

On Wed, 2013-11-06 at 20:40 -0800, John Chodera wrote:

    Not for the carboxylates. You need to correct for the entropy
    involved with having 2 degenerate syn-hydrogens and 2
    degenerate anti-hydrogens.

Is the entropic contribution analytically computable (e.g. kT ln N, where N is the degeneracy)?

Assuming you know the degeneracy this works. The syn- and anti- protons are obviously not degenerate, so it's not a simple kT ln 4 or kT ln 2. I think the energy difference between syn- and anti- is small enough that the anti- protons contribute a noticeable amount (at least for AS4).

I've never considered this to be important, though. What's important is that when you titrate at pH==pKa, the protonation fraction is 0.5 and the relative populations of cis- to anti- are correct. As a result, I just get a good initial guess for the reference energy and then run a titration to optimize it (and verify the relative protonation populations are correct).

Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher

jchodera commented 10 years ago

I've never considered this to be important, though. What's important is that when you titrate at pH==pKa, the protonation fraction is 0.5 and the relative populations of cis- to anti- are correct.

This should be mathematically equivalent to computing the correct free energy between the titration states for the molecular mechanics forcefield, so it's odd that this would be unimportant. Operationally, however, I do see there could be an advantage in a two-stage procedure where the first stage estimates the free energy difference and the second stage adjusts to ensure 0.5 fractions. I think, however, the whole procedure could be simplified if we focused on getting the correct free energy differences in the first stage.

Excellent point about the degeneracies (O1 and O2 protonation) and near-degeneracies (syn and anti). I imagine that both of these problems stem from the torsional barriers involving the oxygen and hydrogen. Were these torsion barriers reduced or eliminated in some way (for example, during intermediate alchemical states), convergence of the correct free energy between the protomers would be accelerated and the correct free energy would be obtained.

What do you think about simply removing the torsion barrier during intermediate alchemical states when the free energy difference between the protomers is computed, and then verifying this scheme works using constant-pH REMD (simply as a validation of the method)?

swails commented 10 years ago

On Thu, Nov 7, 2013 at 3:27 PM, John Chodera notifications@github.comwrote:

I've never considered this to be important, though. What's important is that when you titrate at pH==pKa, the protonation fraction is 0.5 and the relative populations of cis- to anti- are correct.

This should be mathematically equivalent to computing the correct free energy between the titration states for the molecular mechanics forcefield, so it's odd that this would be unimportant.

I agree completely. Assuming that your implementation of CpHMD and your FE calculation are correct, these are equivalent. You can validate it with residues that do not require an entropy correction, like cysteine, lysine, and tyrosine. (Basically, set the reference energy equal to the TI- or H-REMD-computed free energy and make sure that you get a titration curve that gives a pKa of 0). For CYS, LYS and TYR no correction should be needed (and none is). For AS4 and GL4, you need to correct the free energy you compute from the TI stage by the number and relative energies of distinct states (syn- and anti-) along with their degeneracies (2). Getting the correction is what I claim is unimportant. The correction will be fairly small (between kT ln(2) and kT ln(4), formally), so just calculate the titration curve after getting the free energy of charging a syn- proton and shift the free energy by the amount required to get a pKa of 0. Since this is equivalent to getting the relative free energies of every state and you should always validate the reference energy by titrating the model compound, anyway, what's the point of making the first step super complicated to get it 'right'? I've never bothered because I never needed to.

Operationally, however, I do see there could be an advantage in a two-stage

procedure where the first stage estimates the free energy difference and the second stage adjusts to ensure 0.5 fractions. I think, however, the whole procedure could be simplified if we focused on getting the correct free energy differences in the first stage.

In my experience, I had found the 2-step procedure to be simpler to implement than getting a correction-free reference energy for the carboxylate residues. Each residue with tautomeric states will require a different procedure -- case in point HIS vs. ASP/GLU. It may be easier to implement stage 1 the way you describe than I thought it would be, but I never needed to do it.

Excellent point about the degeneracies (O1 and O2 protonation) and

near-degeneracies (syn and anti). I imagine that both of these problems stem from the torsional barriers involving the oxygen and hydrogen. Were these torsion barriers reduced or eliminated in some way (for example, during intermediate alchemical states), convergence of the correct free energy between the protomers would be accelerated and the correct free energy would be obtained.

The syn- and anti- protons never swap in AS4 and GL4. There's an improper torsion placed on them to keep them in their relative positions. The different syn- and anti- states are sampled via MC protonation state changes. You would have to compute this relative FE difference by doing 2 different alchemical changes: appear a proton in the syn- position and appear a proton in the anti- position. (And you obviously need to compute the FE using AS4, not ASH).

What do you think about simply removing the torsion barrier during

intermediate alchemical s

tates when the free energy difference between the protomers is computed, and then verifying this scheme works using constant-pH REMD (simply as a validation of the method)?

Sounds reasonable if it's easy enough to implement and test, I suppose. Like I said, I haven't given much thought about relative protomer energies in carboxylate residues since I decided it would simply be easier to correct the reference energy (in a single step) using a titration of the model compound (which converges very quickly).

Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher

jchodera commented 10 years ago

For AS4 and GL4, you need to correct the free energy you compute from the TI stage by the number and relative energies of distinct states (syn- and anti-) along with their degeneracies (2). The syn- and anti- protons never swap in AS4 and GL4. There's an improper torsion placed on them to keep them in their relative positions. The different syn- and anti- states are sampled via MC protonation state changes. You would have to compute this relative FE difference by doing 2 different alchemical changes: appear a proton in the syn- position and appear a proton in the anti- position. (And you obviously need to compute the FE using AS4, not ASH).

Ah, I see! The issue here is that syn and anti are really different chemical species as far as AMBER is concerned, since they can never interconvert. So then yes, we would need to compute the free energy differences between each of these protomeric species.

The sampling among these species by MC is probably much more efficient than waiting for interconversion anyway.

Sounds reasonable if it's easy enough to implement and test, I suppose. Like I said, I haven't given much thought about relative protomer energies in carboxylate residues since I decided it would simply be easier to correct the reference energy (in a single step) using a titration of the model compound (which converges very quickly).

Good point. These different calculation schemes are not quite equivalent, and one may be much more efficient. The free energy of interest, however, can be obtained from either kind of calculation.

Let's turn the question around, then: What would you think of setting the relative free energy difference to something very crude (say zero) and then running a constant pH REMD calculation across a wide range of pH to estimate the correct protomer free energy differences? The problem could be that the pH range one would need to simulate may be quite large to ensure all protomers are sampled sufficiently, and the spacing of the pH might be difficult to determine a priori.

So maybe your two-stage procedure is the simplest after all!

I'll work on implementation of the weekend. Thanks again for all of the excellent advice!

swails commented 10 years ago

On Thu, Nov 7, 2013 at 5:52 PM, John Chodera notifications@github.comwrote:

For AS4 and GL4, you need to correct the free energy you compute from the TI stage by the number and relative energies of distinct states (syn- and anti-) along with their degeneracies (2). The syn- and anti- protons never swap in AS4 and GL4. There's an improper torsion placed on them to keep them in their relative positions. The different syn- and anti- states are sampled via MC protonation state changes. You would have to compute this relative FE difference by doing 2 different alchemical changes: appear a proton in the syn- position and appear a proton in the anti- position. (And you obviously need to compute the FE using AS4, not ASH).

Ah, I see! The issue here is that syn and anti are really different chemical species as far as AMBER is concerned, since they can never interconvert. So then yes, we would need to compute the free energy differences between each of these protomeric species.

The sampling among these species by MC is probably much more efficient than waiting for interconversion anyway.

It is. I've run the simulations (comparing ASH which allows interconversions and AS4 titrating at low pH).

Sounds reasonable if it's easy enough to implement and test, I suppose. Like I said, I haven't given much thought about relative protomer energies in carboxylate residues since I decided it would simply be easier to correct the reference energy (in a single step) using a titration of the model compound (which converges very quickly).

Good point. These different calculation schemes are not quite equivalent, and one may be much more efficient. The free energy of interest, however, can be obtained from either kind of calculation.

Let's turn the question around, then: What would you think of setting the relative free energy difference to something very crude (say zero) and then running a constant pH REMD calculation across a wide range of pH to estimate the correct protomer free energy differences? The problem could be that the pH range one would need to simulate may be quite large to ensure all protomers are sampled sufficiently, and the spacing of the pH might be difficult to determine a priori.

This will not work. The reference energies span too large of a range. The amino acids, for instance:

AS4: 26.8894581 kcal/mol GL4: 8.4057785 kcal/mol TYR: 65.113428 kcal/mol HIP: 2.84183 kcal/mol, 6.483630 kcal/mol (2 tautomers) LYS: 15.1417977 kcal/mol CYS: 77.4666763 kcal/mol

The pH range necessary to cover this ca. 80 kcal/mol range is unreasonably large. The nucleic acids have (some) even larger reference energies (up to ~150 kcal/mol).

So maybe your two-stage procedure is the simplest after all!

It's the simplest one I've found (and it has the advantage of validating the method and optimizing the reference energy in one step).

I'll work on implementation of the weekend. Thanks again for all of the excellent advice!

Cheers!

Jason M. Swails BioMaPS, Rutgers University Postdoctoral Researcher

swails commented 10 years ago

Let's turn the question around, then: What would you think of setting the relative free energy difference to something very crude (say zero) and then running a constant pH REMD calculation across a wide range of pH to estimate the correct protomer free energy differences? The problem could be that the pH range one would need to simulate may be quite large to ensure all protomers are sampled sufficiently, and the spacing of the pH might be difficult to determine a priori.

I thought about this a little more, and I think I have a good solution. As long as the crude guess was good enough, this will work. Zero is a bad starting guess as I've already explained previously. A starting guess that I would expect is 'good enough' is just the energy difference between the two charge states in the initial conformation. Think of it as a free energy averaged over 1 snapshot between frames that are too large. It would offend the sensibilities of even the least sensible people if you tried to claim that's accurate for a free energy. But it should get you in the right ballpark with virtually no cost and can be easily implemented in a reference energy derivation workflow.