Protonation and tautomeric states

jchodera commented 8 years ago

As noted in https://github.com/choderalab/perses/pull/64#issuecomment-172665228, we have to think about how to handle protonation and tautomeric states within perses.

We have a few options for this. The most straightforward is to simply let the various protonation/tautomeric states of both polymers (e.g. protein residues) and small molecules be selected as targets for transformation proposals. This would involve changing our proposals to be two-stage:

select molecule ~ P(molecule | current state)
select protomer ~ P(protomer | molecule)

For example, for a small molecule, we might draw the molecule identity and then select from among the protonation states accessible by the molecule. For a protein residue mutation, we might select the new residue identity and then select among the protonation states accessible to that residue. Probabilities could be uniform, proportional to the population of the protomeric species in solution, or possibly even tuned to be proportional to the population in the complex (though this seems much harder).

We will need a few things to make this work:

A way to recognize that the various protonation states all represent equivalent chemical species that should have equivalent target probabilities in the SAMS recursion. This is most simply handled by having the "key" we use to keep track of which species has which target probabilities and log weights be insensitive to the choice of protonation state.
A way to introduce an additional log bias weight to account for the aqueous population of each titration state. This should be such that, for the small molecule or isolated (blocked) residue in solution, the target populations \hat{\pi}_k of each titration state match the aqueous populations given by Epik or the population computed from the pKa at the given pH. This log weight would be introduced into the acceptance probability.
These log weights would need to be calibrated by SAMS to give the aqueous populations of each titration state from Epik or the population computed from the pKa. We could pre-compute them for the given forcefield and nonbonded treatment. Alternatively, we could figure out a more clever way to also include these calibration simulations in the recursion scheme so that this is calibrated on the fly, but it might be complicated to figure out a nice, general framework to make this simple.

pgrinaway commented 8 years ago

I think at least initially (before the first paper), if we implement this, we should use a uniform proposal and not do anything fancy--I think getting that paper out is the priority, and it sounds like this could get pretty complicated*. Perhaps we could point out in the paper in a bit of foreshadowing that something like this is possible?

*note that I'm not shying away from additional complexity, nor am I shying away from treating this very important issue. I just don't want to have an ever-expanding list of features when we're trying to get out a paper about the main idea of design with SAMS and RJMC.

pgrinaway commented 8 years ago

That said, after the first paper, I think it would be really interesting to explore the last option, where we examine including the calibration simulations in the recursion.

steven-albanese commented 8 years ago

+1 on @pgrinaway's point about the first paper

jchodera commented 8 years ago

We certainly don't need to worry about protonation states for a first paper that targets T4 lysozyme cavity mutants or safe host-guest systems---we would just exclude compounds with protonation state issues.

We should definitely make sure that the architecture allows this feature to be added when appropriate, however, as it will be critical for any nontrivial system. So it's worth doing a bit of brainstorming now to make sure our design allows us to slot in the innards when we're ready to.

jchodera commented 8 years ago

As noted in #64, if we add in the appropriate architecture, we can get one feature for free without needing any of the calibration stuff or pKa prediction machinery: sampling the appropriate HIS tautomers (HID/HIE). So if we build the framework now and just slot in what we need for selecting the appropriate HIS tautomers, we can come back later and explore the options above for small molecules and protonation states.

pgrinaway commented 8 years ago

We should definitely make sure that the architecture allows this feature to be added when appropriate, however, as it will be critical for any nontrivial system.

Good point, +1.

Brainstorming a bit then--where would the API for this go? I could imagine a couple options:

Allowing an optional argument to the ProposalEngine constructor that takes an object that selects protomers/tautomers. Maybe this is a bit janky, though.
A separate proposal stage that takes the output of the first proposal and then chooses a tautomer (perhaps more obvious as to what is conceptually happening, and maybe easier to combine).
Including the relative weights in a Sampler.

Thoughts/comments? I am just brainstorming a bit here so maybe those are not so good...

jchodera commented 8 years ago

I wonder if the machinery can primarily be hidden inside the ProposalEngine subclasses.

If we didn't need to calibrate the bias weights and were able to compute them from the aqueous protomer populations directly, we could have the SmallMoleculeProposalEngine can call out to Epik as needed and the PolymerProposalEngine could use reference pKas. We would just need to make sure that the ProposalEngine returns both the proposed topology and the key that identifies how this chemical species is to be counted by the SAMS machinery: eg "HIS" instead of HID or HIE mutants. We may also want to record the protomer variant information for later analysis. This would let us use a simple model---either precalibration for amino acid protomers, or estimates from subtracting off the GBSA solvation energy for small molecules. I don't believe we currently have this in our API (maybe "metadata" for the protonation state details, but where does the "key" used for SAMS free energy bias tracking get returned?).

If we want to also calibrate the log bias for each protomer in the same solvent model and simulation settings that the binding simulations are running with, things get tricky. This is where some thought might help us make better architectural decisions. I wonder if we can simply add another SAMS simulation for the small molecule in solvent where the target probabilities of protonation states are fixed to the Epik aqueous populations, though we would track these on a per-protomer level rather than a per-compound level.

There is probably a very simple implementation hiding here. If ProposalEngine returned (compound, protomer) pairs, SAMS uses the more specific version (protomer?) for estimating free energies, and the OptimizationEngine could allow us to specify target probabilities in terms of either compound or protomer, would that handle everything?

jchodera commented 8 years ago

(BTW, I've updated the papers horizon google doc to map out a way we can take small steps to the ultimate solution here over the course of several papers.)

choderalab / perses

Protonation and tautomeric states #66