restructure - Githubissues

I think it's time, for the sake of scaling this up to more systematic experiments, to restructure this project.

The primary challenge I realized when designing experiments and structuring the current project is that, there are too many parts in this streamline that we want to model. Specifically:

molecule object (rdkit.Molecule, openeye.GraphMol, or openforcefield.Molecule)

↓

graph object (dgl.Graph or dgl.HeteroGraph) the object that contains the information regarding the identities / attributes of the atoms and how they are connected.

↓

parametrized graph object (still dgl.Graph or dgl.HeteroGraph) the object that contains all the parameters necessary for MM-like energy evaluation, namely ks and eqs and so forth

↓

energy (scalar tensor or a list of scalar tensor) the end object in both fitting and inference

There are problems with each of these objects, and we can argue that there are still downstream objects that we can have, namely those needed for openmm simulations.

But first, here's how I picture to structure the repo:

espaloma/ core code for graph nets-powered force field fitting.
- graph.py graph abstraction of molecules.
- parametrized_graph.py molecular graph with all the parameters needed for energy evaluation
- net.py interface for neural networks that operates on espaloma.graph
- mm/ helper functions to calculate MM energy
scripts/ experiment scripts
- typing/ experiment scripts to reproduce atom-typing
- mm_fitting/ experiment scripts to fit espaloma potentials to Molecular Mechanics
- qm_fitting/ experiment scripts to fit espaloma potentials to Quantum Mechanics -class_i/
  - class_ii/ with the inclusion of coupling and higher-order terms
- deployment/ experiment to provide interface to openmm.system objects and run actual MD.
devtools development tools

The following functions are needed to allow the aforementioned streamline in a cleaner manner:

espaloma.graph.from_oemol and friends: read molecules into graphs
espaloma.parameterized_graph.from_forcefield(forcefield='gaff2') parametrize a mol graph using a legacy forcefield. we will likely need to port from other repos for these implementations, or import them.
`espaloma.parametrized_graph.from_nn()' parametrize a mol from neural network models that could be trained.
espalmoa.mm.energy(g, x)evaluate energy from the parametrized molecule (however it's parametrized) and its geometry ( lots of helper functions are needed, of course, and test coverage would ideally ensure it's consistency with OpenMM although that might be tricky for especially nonbondned terms.

The following are the trickiest choices that I would like to have a discussion here:

ways to structure graph currently this is done by having both graph, heterograph and hierarchical_graph objects as input. I find this a bit ugly. I suggest allowing only one kind of graph as the input of either NNs or legacy force fields, and put whatever needed to express the relationships between graph entities as part of the model. Note, however, that we would need tricks to make sure that this part of the modeling is only executed exactly once during training.
ways to output energies without any sum functions, one molecule would have separated bond, angle, and nonbonded energies all with different dimensions. this becomes even more complicated when you have a batch with multiple molecules. I think it's critical that we find a simple way to output energies
ways to have a general enough espaloma.net object to enable future expansion now on the representation level we already have a bunch of ideas: atom-level message-passing only, hierarchical message-passing, or somewhat fancier version of it that I proposed to use an RNN to encode the walks with different length (corresponding to different levels in the hierarchy). If we were to limit the input graph to be universal, how are we going to develop the net module so that it can be not too much of a headache to express these ideas.

Nice, thanks for these thoughtful observations and proposals, @yuanqing-wang ! I mulled this over yesterday afternoon and this morning, and largely agree with this. We want to separate concerns in a way that lets us do the comparisons we plan as painlessly as possible, and it's tricky because we want to make comparisons at a few different levels.

I would draw the lines between the tasks slightly differently, and in particular I would emphasize the distinctions between the molecule graph, the graph you perform message-passing on, and the factor graph representing the potential energy function.

I would start by stating the overall task in a way that doesn't commit to a specific way to solve it.

f: molecule --> potential_energy_fxn

where f is purely a function of the molecular graph, and may incorporate any of the choices we want to compare in this study ("reference MM force field", "fingerprints and random forests", "graph-nets", ...), and potential_energy_fxn : coordinates --> energy is a function purely of the geometry (but which may have internal attributes we want to inspect or export).

To implement f using a graph-net, you may further split this up: f(molecule) = read_out_net(graph_net(derive_graph(molecule))), where:

derive_graph : molecule --> message-passing-friendly graph -- given a molecular graph (nodes = atoms, edges = bonds), construct another graph where message-passing is a good idea. For example, you may return exactly the same structure as the input molecular graph, just with some features written on the atoms and bonds. (We're also interested in more elaborate options at this step.)
graph_net : message-passing-friendly graph --> node-, edge-, and global- attributes -- implements the "learned chemical perception" part of the model, by writing vector attributes on each element of the graph, reflecting whatever "chemical context" is needed to assign parameters. (I think this is the only part that really needs to be in dgl or similar.)
readout_net : node-, edge-, and global- attributes --> potential_energy_fxn -- constructs a potential energy function from these vectors. For example (as is already done here), we pre-specify the "skeleton" of a potential energy function as a factor graph (by defining which terms will be present, which terms will be "frozen" vs. "learnable", and what parameters the learnable terms will accept), and then use a parameterized function to map node attributes to (sigma, epsilon, charge) parameters on each atom, appropriate pooling functions map node attributes for pairs, triples, and quadruples of atoms into fixed-length bond-, angle-, and torsion-parameters. (Later, we may also want to make parameterized decisions at this step of which terms to include in potential_energy_fxn, rather than pre-specifying the "skeleton". Initially, I think it makes sense to try sticking to a pre-specified skeleton, and hope the model outputs force-constants near 0 for unneeded terms.)

To implement f using the other approaches that we need for control experiments (e.g., to assess how much mileage we're getting from more or less elaborate choices of derive_graph or graph_net, you may split it up differently, e.g. f(molecule) = readout_net(fingerprint(molecule)), where

fingerprint: molecule --> atom-centered fingerprints
readout_net : atom-centered fingerprints --> potential_energy_fxn

For different tasks, we may be interested in doing different things with the output potential_energy_fxn.

For the energy-regression task, we may check that when we evaluate the function on snapshots it agrees with some other way to compute the energy of those snapshots.
For the interaction-typing task, we may introduce an additional step in our implementation of f (to predict a discrete class label for each interaction term, or for each atom), and check that we match a reference force-field's assignments of those class labels.
For validation, we may run some steps of local energy minimization with potential_energy_fxn and check that we land near where we expect.
For applications, we may export a representation of potential_energy_fxn for execution in an MD engine.

Thanks for fleshing this out! @maxentile I think it's a lot clearer to further distinct the graphs into these stages.

I would suggest that we look more closely at potential_energy_fn. We might benefit from yet another intermediate layer named something like parametrized_graph. I think it's not super straightforward how to jump directly from readout to a function object that takes in coordinates as input and outputs energy.

Moreover, we could compare parametrized_graph objects between methods. Also it would make it a lot easier to batch etc.

I would suggest that we look more closely at potential_energy_fn. We might benefit from yet another intermediate layer named something like parametrized_graph. I think it's not super straightforward how to jump directly from readout to a function object that takes in coordinates as input and outputs energy.

Agreed -- there's a lot going on at that step!

Perhaps we can narrow down by listing the types of comparisons involving this step that we need to support, and the types of comparisons we may eventually want to support.

Required comparisons (based on current version of paper outline):

"Class I" vs. "Class II" -- Generate a factor graph that contains "all class I interaction terms" for a molecule, vs. one that contains "all class II interaction terms" for a molecule. Fixing a functional form for each term appearing in this factor graph, the "readout" task is to "write a vector of required parameters on each factor node."

To support this, maybe we have two subclasses of PotentialEnergy -- PotentialEnergyClass1, PotentialEnergyClass2 -- which contain differently structured factor graphs as instance attributes. Perhaps these factor graphs are initialized from a molecule to have all possible terms present but with default values for each term's parameters (e.g. force constants of 0 for all harmonic_*_force factors). A learned parameter-assignment function overwrites some of these parameters. These classes should support both autodiff-friendly computation of energies / forces on batches of geometries, as well as export to an OpenMM-friendly representation.

Speculative comparisons:

Learned atom representations vs. learned factor representations. A question I'm personally interested in is: should message-passing-friendly graph include nodes corresponding to interaction-terms (whose learned representations are formed by message-passing) or only include nodes corresponding to atoms (whose learned representations are pooled in an appropriate way at the last step)? This is analogous to the "atom-typing" vs. "direct chemical perception" distinction emphasized in development of smirnoff.
Standard functional forms vs. polynomial or neuralized functional forms for each interaction term. How much are we limited by the logical structure of a MM model (regardless of how flexible the terms are) vs. how much are we limited by the choice of functional form (regardless of how flexible the parameter-assignment process is) vs. how much are we limited by parameter-assignment.

Rather than hard-code Class I and Class II, how about having a ParametrizedGraph object that can host both. And supply term in I and II as methods of the object. We can of course have presets as flags saying something like,

if self.terms == 'class-i':
    self.terms = [getattr(self, x) for x in ['bond', 'angle', 'non_bonded']

Learned atom representations vs. learned factor representations.

I agree that we should definitely compare this. But just in terms of how to engineer it: the ParametrizedGraph object resulted from message-passing step should be parameters on the factor graph anyway. The difference you're mentioning, if I understood correctly, is just the architecture choice of parametrizing factor graphs, be it pooling after MP, or hierarchical MP or something else.

Rather than hard-code Class I and Class II, how about having a ParametrizedGraph object that can host both. And supply term in I and II as methods of the object. We can of course have presets as flags saying something like,

if self.terms == 'class-i': self.terms = [getattr(self, x) for x in ['bond', 'angle', 'non_bonded']

Both options are basically the same for a single binary comparison.

In one case, there's a single class ParameterizedGraph, whose method definitions etc. each contain two branches, selected based on whether the object has a flag or not. In the other case there are two sub-classes of ParameterizedGraph, whose method definitions etc. contain one or the other branch.

I prefer to have two sub-classes, rather than having if-else branches that check a string flag to find out what kind of object the ParameterizedGraph really is. I think it doesn't make too much difference if there are only two variants of ParameterizedGraph we'll need to consider in the lifespan of the code.

However, if we're entertaining the possibility of more than two variants of ParameterizedGraph (e.g. allowing different parameterization of each factor), I think it will become increasingly unwieldy to have if/else branches to tell which parts of which method definitions apply based on which flags are present.

In general, when we find ourselves branching on some string that says what class an object is, that's often a hint that we'd be better off with sub-classes. (This page has a helpful comparison of these two options: https://refactoring.guru/replace-conditional-with-polymorphism )

choderalab / espaloma

restructure #9