Move Yank class away from parsing and forwarding options to other classes

andrrizzi commented 8 years ago

Since all the relevant objects are initialized inside Yank (i.e. restraint, alchemical factory, sampling class), we need to pass all the parameters for all classes in a big kwargs dictionary. Then Yank has to use introspection to separate them by class to initialize all the objects correctly. I propose to make an effort to move the initialization of these objects outside Yank.

From a user perspective, the final interface would become something like

sampler = ModifiedHamiltonianExchange(**sampler_options)
factory = AlchemicalFactory(**alchemical_options)
restraint = Harmonic(**harmonic_options)
yank = Yank(store_directory, sampler, factory, **yank_options)
yank.create_phases(thermodynamic_state, restraint, *alchemical_phases)

This is of course only to give an idea, not an actual proposal, but I would implement the restraint part this way now, if we agree that the API should move towards this direction before 1.0 release. I realize that we may need to change a lot of things for this to happen, but I really believe this will turn out to be much easier to maintain and extend in the future.

Question: do we want to attach restraints to phases and then simply perform checks to determine if the restraint is appropriate for that type of phase (i.e. raise an exception if it's not a complex), or pass only a single restraint argument for all phases and adding the restraint automatically (as we currently do)?

jchodera commented 8 years ago

I like it!

However, to have yank.create_phases() be able to create all the phases, all of these things have to be turned into factories that can create, for example, a new set of restraints or a new sampler for each phase. So it would probably need to look more like

sampler_factory = ModifiedHamiltonianExchangeFactory(**sampler_options)
alchemical_factory = AlchemicalFactory(**alchemical_options)
restraint_factory = HarmonicFactory(**harmonic_options)
yank = Yank(store_directory, sampler_factory, alchemical_factory, **yank_options)
yank.create_phases(thermodynamic_state, restraint_ractory, *alchemical_phases)

Maybe there's a way to avoid factory overload?

Question: do we want to attach restraints to phases and then simply perform checks to determine if the restraint is appropriate for that type of phase (i.e. raise an exception if it's not a complex), or pass only a single restraint argument for all phases and adding the restraint automatically (as we currently do)?

Does Yank know that certain kinds of phases should be different from others? If so, it can probably determine whether or not a restraint needs to be created and added.

andrrizzi commented 8 years ago

all of these things have to be turned into factories

I might not see all the implementative hurdles at this point, but actually I think we won't need it. For example, phases are simulated sequentially, which means that once the first phase is done we can just sampler.resume('phase2.nc') and sampler.run() without the need to create two separate objects. If we need multiple restraints for each phase, we'll just have the user pass 2 Restraint objects and attach them to the AlchemicalPhases.

Does Yank know that certain kinds of phases should be different from others?

Yes, currently we pass the restraint type as a single option, and Yank adds the restraints only if it's a complex (i.e. if it receptor_atoms is not empty). I think the question we should focus is: are there cases in which we want two phases to employ two different restraints (relative free energies maybe)?

andrrizzi commented 8 years ago

all of these things have to be turned into factories

Worst case scenario, I guess we can deepcopy the objects that we need before using them.

andrrizzi commented 8 years ago

If there are no other objections on this design, I'll start implementing the restraints part today.

Lnaden commented 8 years ago

+1

jchodera commented 8 years ago

Could we finalize the proposed API as it would be used first? I'm just unclear on what the final proposal was.

andrrizzi commented 8 years ago

The API for the restraints would become (I would discuss changes in how we pass sampler and alchemical parameters later)

yank = Yank(store_directory, **yank_alchemy_repex_options)

# Configure complex and solvent phases
complex_phase = AlchemicalPhase(..., restraint=Harmonic(..., **restraint_parameters))
solvent_phase = AlchemicalPhase(..., restraint=None)

yank.create_phases(thermodynamic_state, complex_phase, solvent_phase)

in yank.create_phases() we raise an exception if we apply a restraint to a system where there is no receptor.

jchodera commented 8 years ago

I'm not sure I follow. Would you use yank.restraints.create_restraints() to create the restraint?

Also, what happens if we have more than two phases? Shouldn't yank.createPhases() take a list of phases? Or will two phases be hardcoded?

I think it's worth trying to nail down a workable version of what you originally proposed

sampler = ModifiedHamiltonianExchange(**sampler_options)
factory = AlchemicalFactory(**alchemical_options)
restraint = Harmonic(**harmonic_options)
yank = Yank(store_directory, sampler, factory, **yank_options)
yank.create_phases(thermodynamic_state, restraint, *alchemical_phases)

since this could considerably simplify everything.

Did you get a chance to look at how the perses sampler stacks are defined?

andrrizzi commented 8 years ago

Would you use yank.restraints.create_restraints() to create the restraint?

The user/driver class can use that method to create the restraint (and YamlBuilder will definitely use it), but there's no need to call that function if someone wants to run Yank from the API using a simple script.

Also, what happens if we have more than two phases? Shouldn't yank.createPhases() take a list of phases? Or will two phases be hardcoded?

The signature is def Yank.create_phases(self, thermodynamic_state, *alchemical_phases):, so Python allows to pass the phases both as a list or as comma separated arguments. It won't be hardcoded.

I think it's worth trying to nail down a workable version of what you originally proposed. Did you get a chance to look at how the perses sampler stacks are defined?

Will take a look now, thanks! I'll be happy to make an entire API proposal before implementing the restraint part. That will take me some more thinking though to be sure everything is feasible.

jchodera commented 8 years ago

The signature is def Yank.create_phases(self, thermodynamic_state, *alchemical_phases):, so Python allows to pass the phases both as a list or as comma separated arguments. It won't be hardcoded.

What if we need to add optional arguments to Yank.create_phases. Will that be a problem?

Will take a look now, thanks! I'll be happy to make an entire API proposal before implementing the restraint part. That will take me some more thinking though to be sure everything is feasible.

I think we could probably save a whole lot of pain if we can figure out a way to restructure the setup of YANK calculations to be simpler. It's worth spending a day or so to evaluate this.

Another example of a simpler setup API is the SAMS sampler stack.

I bet you could think of ways to drastically improve on these idioms, however!

andrrizzi commented 8 years ago

What if we need to add optional arguments to Yank.create_phases. Will that be a problem?

Nope! Previous code will still be compatible.

Another example of a simpler setup API is the SAMS sampler stack.

Thanks! I'll take a look at it, and work out something clearer than what I proposed.

jchodera commented 8 years ago

That won't work. Samplers are bound to a specific storage file, and take a specific system as input. Restraints are added to specific atoms. Even if we don't call them factories, we will have to turn them into factories that take very general configuration arguments in their constructors and then generate specific instances adapted to the system and storage file.

It isn't a bad way to do things, though. The perses and sams projects use lots of factories, but not everything is a factory. I actually like the designs for perses and sams much better, but not sure if it scales to a full fledged system like yank. Maybe you can poke around perses/tests/testsystems.py and sams/tests/testsystems.py to get an idea of that model and see what you think?

andrrizzi commented 8 years ago

In the design above, we pass to Yank a separate restraint object per phase, so we wouldn't need to create new specific instances with factories, the user/driver class is supposed to do that.

take very general configuration arguments in their constructors

That's exactly the idea! We should use samplers/alchemical factory constructors only for general configuration, then Yank is responsible to use them for a specific system. Note that we don't necessarily need to use a factory design, although we might anyway. It's just a matter to add the support for a Sampler object to switch storage file and system (and in general anything that is not the general configuration passed through the constructor). I think the two approaches are in practice equivalent, but I personally think the second one is more elegant.

I still have to check the details of the feasibility though, so maybe there's still some obstacle I don't see. I think this will be clearer when I'll post the full proposal.

andrrizzi commented 8 years ago

Here is the proposal, sorry for the wall of text. I've made bold important points. Let me know what you think and if you find cases that are not covered by these abstractions.

General ideas

Similarly to OpenMM, I think we should make a clear distinction between a library layer, and an application layer. The library layer should be general and very extensible if we want people to use Yank to experiment with different methods (us included), in the application layer we can hide as many details as we want to encapsulate current best practices. Currently the Yank class attempts to do both, but I don't think this is sustainable. So I think we should have something like

Library layer: classes to run and analyze a single alchemical phase
- AlchemicalPhase
- samplers, restraints, system preparation, alchemy etc.
Application layer: classes designed to run and analyze a specific type of free energy calculation. They make use of AlchemicalPhase class.
- BindingFreeEnergy
- HydrationFreeEnergy
- SolvationFreeEnergy
- TransferFreeEnergy

Main differences with what we have:

The single responsibility of Samplers implementations should be how to sample, and it should be independent on what to sample (handled by States), and how to store samples (handled by Reporters). This should decouple the sampler implementation from the particular application. The Sampler abstraction needs to be general enough to accommodate replica exchange, SAMS and any user-defined sampler!
I tried to separate configuration of the objects from their usage by not making constructors part of the interfaces, which means our library objects should not depend on them (application objects will have to). The constructors become the main way to the user to configure different implementation of abstractions (e.g. samplers, alchemical factories, restraints) that require different parameters to be defined. This should increase extensibility.
The expanded state(s) are not passed to Samplers since they are not integrated, but to AlchemicalPhaseReporter which is responsible to compute the energies for states that are not sampled.
I've introduced a new Topography abstractions that can be used to represent interesting areas of our system (binding site, receptor atoms, alchemical atoms etc.)

Examples

An example of how using the library layer would look like


# Configure phase
storage = Storage('path/to/file')
analyzer = AlchemicalPhaseAnalyzer(store_volumes=False, sanity_check_every=500, storage=storage)
sampler = MySampler(mc_moves=True, displacement_sigma=1.0*unit.nanometers,
                    time_step=1.0*unit.femtoseconds, n_steps_per_iteration=500)
factory = AlchemicalFactory(softcore_beta=0.0, annihilate_electrostatics=False)
phase = AlchemicalPhase(factory, sampler, analyzer)

# Prepare initialization
prmtop_file = openmm.app.AmberPrmtopFile('topology.prmtop')
system = prmtop_file.createSystem()
topology = prmtop_file.topology
positions = openmm.app.AmberInpcrdFile('positions.inpcrd').positions

state = ThermodynamicState(system=system, temperature=300*unit.kelvin, pressure=1.0*unit.atmosphere)
sampler_state = SamplerState(positions=positions)
topography = ComplexTopography(topology=topology, ligand_dsl='resname MOL')
restraint = Harmonic(k=1.0)
protocol = AlchemicalState.from_lambda_values(lambda_electrostatics=[1.0, 0.5, 0.0, 0.0, 0.0],
                                              lambda_sterics=[1.0, 1.0, 1.0, 0.5, 0.0])

# Initialize phase and run
phase.initialize(state, restraint, protocol, topography, storage)
phase.run(1000)
analyzer.analyze()

An example of a simple implementation for for a generic MultiStateSampler exploiting the State abstraction (see next section)

class MySampler(MultiStateSampler):

    def run_iterations(self, n_iterations):
        for n_iteration in range(n_iterations):
            for state_id in range(len(self._thermodynamic_states)):
                thermodynamic_state = self._thermodynamic_states[state_id]
                sampler_state = self._sampler_states[state_id]

                if thermodynamic_state.is_context_compatible(self._context):
                    thermodynamic_state.apply_to_context(self._context)
                else:
                    del self._context
                    self._context = thermodynamic_state.create_context(self._integrator, self._platform)

                sampler_state.apply_to_context(self._context)
                self._integrator.step(self._n_steps_per_iterations)
                sampler_state.update(self._context)

            # Do sampler stuff (e.g. exchange thermodynamic states, propose new thermodynamic states
            # using a proposal engine, report mixing statistics to user)

            self._reporter.report(self._thermodynamic_states, self._sampler_states)

Abstractions

I wrote a set of abstractions that I think we should use as dependency for our objects. I didn't have time to think of a general API for analysis and storage but

an Analyzer can be just an extension of a Reporter since it knows how the info is stored
For Storage there are nice examples in perses and openpathsampling.

# =============================================================================
# Storage
# =============================================================================

class Storage:
    """TODO: High-level storage interface that decouple representation from format.

    See netcdfplus package in openpathsampling, or storage module in perses for
    inspirational examples.
    """

class Storable:
    """Represent any object that can be saved into a storage."""

    def storage(self):
        """The storage associated to this object."""

    def is_stored(self):
        """True if the associated storage contains a complete representation of
        the object."""

    def from_storage(self, storage):
        """Constructor. Restore a previously saved object into memory."""

    def save(self):
        """Save the object into its associated storage."""

# =============================================================================
# States
# =============================================================================

class State:
    """Generally represent the state, or a portion of the state, of an OpenMM
    Context.

    Logically, a State is any object that can be applied to a Context.

    """
    def apply_to_context(self, context):
        """Set the parameters of the given context to be consistent with this
        state.

        Raises
        ------
        StateException
            If is_context_compatible(context) is False.

        """

    def is_context_compatible(self, context):
        """Return True if the given context can be set to this state.

        False, generally means that the thermodynamic state of the context is
        incompatible (e.g. different ensemble, different number of particles).

        """

class SamplerState(State):
    """Represent the portion of the state of an OpenMM Context that changes
    with integration.

    SamplerStates provide an update_from_context method that allow the
    application to keep this state up-to-date after integration. It guarantees
    to consistently set box vectors, and particles positions and velocities
    of the context.

    """
    def positions(self):

    def velocities(self):

    def box_vectors(self):

    def update_from_context(self, context):
        """Update this state from the given context."""

class ThermodynamicState(State):
    """Completely define the thermodynamic state of an OpenMM Context.

    ThermodynamicStates is responsible to store quantities that do not change
    with integration. They are also associated to a particular System object,
    and they are responsible to create new OpenMM Context objects when provided
    with an integrator.

    TODO: will we need/able to extend this formalism to micro/grand canonical
    ensembles?

    """
    def system(self):

    def temperature(self):

    def volume(self):
        """The volume of the system. None, if the volume fluctuates."""

    def pressure(self):
        """The pressure of the system. None, if the pressure fluctuates."""

    def create_context(self, integrator, platform):
        """Return a new context in this thermodynamic state."""

    def is_state_compatible(self, thermodynamic_state):
        """Return True if contexts created by the given ThermodynamicState
        can be converted to this state.

        False, implies that a new Context will have to be created to sample
        this particular state. This should be functionally equivalent to the
        method is_context_compatible with the exception that you don't have
        to create a Context object to test compatibility.

        """

    def reduced_potential(self, context):
        """Return the reduced potential for the context."""

    def reduced_potential_from_state(self, sampler_state):
        """Return the reduced potential for the sampler state."""

# =============================================================================
# Reporters
# =============================================================================

class MultiStateReporter:
    """A reporter for multiple thermodynamic state simulations."""

    def report(self, thermodynamic_states, sampler_states):
        """Save the interesting information extracted from the given states.

        This method is called at every iteration by a MultiStateSampler, and it
        is responsible to sparsify the reports. It could also show sanity check
        reports for the user at regular intervals or at particular events of the
        simulation.

        """

class AlchemicalPhaseReporter(MultiStateReporter):
    """Report also information about unsampled thermodynamic states."""

    def unsampled_states(self):
        """A list of ThermodynamicStates that are not sampled."""

class AlchemicalPhaseAnalyzer(AlchemicalPhaseReporter):
    """TODO: A reporter capable of performing analysis on the stored data."""

# =============================================================================
# Sampler
# =============================================================================

class MultiStateSampler:
    """Represent an object capable of sampling multiple thermodynamic states."""

    def is_initialized(self):
        pass

    def initialize(self, thermodynamic_states, sampler_states, reporter):
        """Set the initial ThermodynamicStates and SamplerStates to sample.

        Parameters
        ----------
        thermodynamic_states : iterable of ThermodynamicStates
            Thermodynamic states of the systems to sample (or initial
            thermodynamic states if they are sampled too).
        sampler_states : iterable of SamplerStates
            Initial sampler states for the given thermodynamic states. Must
            have the same dimension of thermodynamic_states. These states
            are updated with the new Context state at each iteration.
        reporter : MultiStateReporter
            The reporter used to store the samples. Its method report is
            called at every iteration. It is the reporter's responsibility
            to sparsify the reports.

        """

    def run_iterations(self, n_iterations):
        """Run a fixed number of iterations."""

    def run(self, equilibration_detector):
        """IN THE FUTURE: Run until the equilibration detector determines
        that enough samples have been collected."""

# =============================================================================
# System regions
# =============================================================================

class SystemRegion:
    """Represent a subset of atoms, bonds, angles and dihedral of a system."""

    def topology(self):
        """The topology for this SystemRegion."""

    def atoms(self):
        """The indices of the system particles included in this region."""

    def bonds(self):
        """The indices of the system bonds included in this region."""

    def angles(self):
        """The indices of the system angles included in this region."""

    def torsions(self):
        """The indices of the system angles included in this region."""

    def add_group(self, dsl_expression):
        """Add atoms, bonds, angles and torsions associated to the atoms
        selected by the DSL expression."""

    def is_topology_compatible(self, topology):
        """True if this region refers to the same topology."""

class Topography(SystemRegion):
    """An annotated SystemRegion."""

    def solvent_region(self):
        """Return the portion of the region that belongs to the solvent."""

    def alchemical_region(self):
        """Return the portion of the region that needs to be alchemically modified."""

class ComplexTopography(Topography):
    """Topography of a ligand-receptor complex."""

    def receptor_region(self):

    def ligand_region(self):

    def bindind_site(self):

class SoluteTopography(Topography):
    """Topography of a solute."""

    def solute_region(self):

class RelativeTopography(Topography):
    """Topography for systems for relative free energy calculations."""

    def molecule1_region(self):
        """Return the portion of the region that belongs to the first molecule."""

    def molecule2_region(self):
        """Return the portion of the region that belongs to the second molecule."""

# =============================================================================
# Restraint
# =============================================================================

class Restraint:

    def standard_state_correction(self, thermodynamic_state):
        """Get the standard state correction for the given thermodynamic state."""

    def restrain_state(self, thermodynamic_state, complex_topography=None, sampler_state=None):
        """Add the restraint Force (or Forces) to the thermodynamic state's system.

        Notes
        -----
        Parameters that are internally undefined can be determined automatically
        if sampler_state is not None. Same for restrained atoms if they are
        internally undefined and complex_topography is not None.
        """

# =============================================================================
# Alchemical factory
# =============================================================================

class AlchemicalFactory:

    def create_alchemical_system(self, reference_state, system_region):
        """Return an alchemically perturbed OpenMM System."""

# =============================================================================
# Alchemical phase
# =============================================================================

class AlchemicalPhase:
    """General alchemical free energy calculation of a single thermodynamic leg."""

    def __init__(self, alchemical_factory, sampler, analyzer):

    def initialize(self, reference_state, restraint, alchemical_states, topography, storage):
        """Create the alchemical system and initialize the sampler"""

    def run_iterations(self, n_iterations):

jchodera commented 8 years ago

@andrrizzi, this looks great! Can we find some time to discuss Wed or Thu?

jchodera commented 8 years ago

Just elaborating on the high-level API with a few minor changes.

Should the analysis engine analyze a single phase, or should that analyze a set of two or more phases so it can report on the free energy of binding, etc?
How do we specify both complex and ligand phases? The example just features one phase.
I love the idea of having the sampler be more independent, so that it will be easier to compose more elaborate sampling schemes. Do we go the whole way and adopt something like perses, with an MCMCSampler at its base? If so, we could separate our current sampling scheme into a GHMC sampler block followed by ligand Monte Carlo rotation and translation blocks. We could clean up the already written mcmc module to do this. The downside is that this approach incurs overhead of creating/destroying a Context every iteration, and I'm not sure what the current overhead on this is. On the other hand, it makes a lot of things very simple, including the use of scaled charges to experiment with alternative forms of alchemical intermediates (that don't use the current approach to PME).

andrrizzi commented 8 years ago

I think we should have both. We need an Analyzer(Reporter) for a single AlchemicalPhase for the sake of flexibility. Then, for example, our application classes like BindingFreeEnergy can expose a BindingFreeEnergy.analyze() method that knows how to mix correctly the results from the two Analyzers.
We can have application classes that take care of that similar to how Yank does it right now, but with less freedom. If the user wants more freedom, I thought he could setup the simulation with two AlchemicalPhase. Something similar to (details to be defined)

# Configure complex phase
storage = Storage('path/to/complex_file.nc')
analyzer = AlchemicalPhaseAnalyzer(store_volumes=False, sanity_check_every=500, storage=storage)
sampler = MySampler(integrator=Langevin(temperature, ...), mc_moves=True, displacement_sigma=1.0*unit.nanometers,
                    time_step=1.0*unit.femtoseconds, n_steps_per_iteration=500)
factory = AlchemicalFactory(softcore_beta=0.0, annihilate_electrostatics=False)
alchemical_phase = AlchemicalPhase(factory, sampler, analyzer)

# Initialize phase and run complex phase
state = ThermodynamicState(system=system_complex, temperature=300*unit.kelvin, pressure=1.0*unit.atmosphere)
sampler_state = SamplerState(positions=complex_inpcrd.positions)
topography = Topography(topology=complex_prmtop.topology, ligand_dsl='resname MOL')
restraint = Harmonic(k=1.0)
protocol = AlchemicalState.from_lambda_values(lambda_electrostatics=[1.0, 0.5, 0.0, 0.0, 0.0],
                                              lambda_sterics=[1.0, 1.0, 1.0, 0.5, 0.0])

phase.initialize(state, restraint, protocol, topography, storage)
phase.run(1000)
free_energy_complex = analyzer.analyze()

# Update configuration for phase 2, we can recycle old configuration since we are going to reinitialize the AlchemicalPhase, or we can instantiate a whole new AlchemicalPhase object.
analyzer.storage = Storage('path/to/solvent_file.nc')

# Initialize phase and run complex phase
state.system = ThermodynamicState(system=system_solvent, temperature=300*unit.kelvin, pressure=1.0*unit.atmosphere)
sampler_state = SamplerState(positions=solvent_inpcrd.positions)
topography = Topography(topology=complex_prmtop.topology, ligand_dsl='resname MOL')
restraint = None

phase.initialize(state, restraint, protocol, topography, storage)
phase.run(1000)
free_energy_solvent = analyzer.analyze()

print(free_energy_solvent - free_energy_complex)

Technically, AlchemicalPhase doesn't care about the internal implementation, so as long as the Sampler object that we are passing has a method that takes two lists of ThermodynamicStates and SamplerStates, we are good. The sampler can take a list of different MCMCMoves or simply a list of different Integrators in the constructor. I wouldn't constrain this if we can avoid it.

Lnaden commented 8 years ago

Massive API overhaul that will be part of YANK 1.0 release, but not the preview

pgrinaway commented 8 years ago

The expanded state(s) are not passed to Samplers since they are not integrated,

But in the case of SAMS, aren't we integrating both? The MCMC sampler targets p(x, lambda).

andrrizzi commented 8 years ago

But in the case of SAMS, aren't we integrating both?

Sorry for the confusion, by "expanded state" here I mean the system with an enlarged cutoff that we use to reweight our samples to correct for anisotropic long-range dispersion interactions.

pgrinaway commented 8 years ago

Oh, sorry, totally confused that with the expanded ensemble concept. Nevermind then...

andrrizzi commented 7 years ago

This was implemented in several different PRs here and in openmmtools that were released with YANK 0.16.

choderalab / yank