Choice of regression target?

maxentile commented 4 years ago

Currently the regression target is the total energy of each (molecule, configuration) pair, including the prediction of a geometry-independent per-molecule offset and prediction of geometry-dependent "strain" energy. However, for the QCArchive subset @yuanqing-wang is looking at, the variation of the per-molecule offsets initially appears much larger in magnitude than the conformation-dependent variation within a molecule's collection of snapshots.

Should we do something to decompose the variance into these two components, i.e. (1) predict the constant offset for each molecule, and (2) assuming away the constant offset, predict geometry-dependent strain energies for a given molecule? (To target (1), we can assume away any geometry-dependence and try to predict just the energy of a molecule's global minimum snapshot from its topology. To target (2), we can assume away any constant offset and try to minimize standard deviation of the residuals.)

Also, the energy prediction currently does not include electrostatic contribution. Should the regression target be something other than total energy? (Initially, it seems reasonable to target the valence contributions, for example by targeting QM total energy minus a MM-predicted nonbonded contribution, where the MM prediction uses Parsley's partial charges, sigmas, epsilons, combining rules, and exceptions.)

jchodera commented 4 years ago

Our target shouldn't care about the target-dependent offset, should it?

We also can't decompose QM into valence and electrostatics easily (without SAPT-like methods, which can also be problematic).

maxentile commented 4 years ago

Our target shouldn't care about the target-dependent offset, should it?

I guess it depends what the goal is. For modeling the conformational distribution of a given molecule, any constant offset of the energy is of course irrelevant. For estimating a logZ, a constant offset is relevant. (For estimating logZ differences some constant offsets become irrelevant again.)

I think in this project the priority should be the conformation-dependent part, as the offset is not always needed, is not really modeled in MM, and can be obtained by other means if needed.

At least, I would like to separate our reported regression errors into those two tasks, rather than treating both as a single task.

We also can't decompose QM into valence and electrostatics easily (without SAPT-like methods, which can also be problematic).

Sorry, I didn't mean to suggest that QM_total minus MM_nonbonded was a quantity we should try to get by decomposing the results of a QM calculation.

Instead I was suggesting to "freeze" all the parameters of the MM nonbonded model that we plan to use, and fit the MM valence terms to the residual. (The QM doesn't decompose into valence + electrostatic + vdW, but the MM model does.)

A modeling reason to consider doing this is if we have a model whose LJ parameters have important information about condensed-phase or intermolecular behavior "baked in" that we (1) don't expect to be able to infer reliably from QM energies of isolated small molecules in vacuum or (2) risk messing up by fitting to QM energies of isolated small molecules in vacuum.

A numerical reason to consider doing this -- at least initially -- is that the nonbonded terms involve more aggressive exponents than the valence terms, and I think it is good to start with variants of an approach that are more likely to be numerically stable before proceeding to more complete but more challenging variants. (Looking at reports @yuanqing-wang has generated from initial experiments that included LJ but not electrostatics in a model for total energy, numerical stability does seem to be a relevant concern here.)

jchodera commented 4 years ago

A numerical reason to consider doing this -- at least initially -- is that the nonbonded terms involve more aggressive exponents than the valence terms, and I think it is good to start with variants of an approach that are more likely to be numerically stable before proceeding to more complete but more challenging variants.

Another widely-supported possibility that is less "aggressive" is to use exponential-6 (Buckingham) instead of LJ 12-6: https://en.wikipedia.org/wiki/Buckingham_potential

maxentile commented 4 years ago

In addition to solidifying choice of what quantities we want to regress on (relative potential energy, vs. relative potential energy minus certain nonbonded terms), I think we need to narrow down a bit the collection of molecules, the way that the snapshots are generated, and the way the target energies are computed.

I think so far @yuanqing-wang has mostly looked at molecules in the ANI dataset (very off-equilibrium, but with snapshots further filtered by an energy threshold), QM9 dataset (minimized), and samples from some QCArchive datasets (usually nearly minimized, sometimes generated by torsion scans).

For the positive control experiments where we seek to recover a molecular mechanics energy model, I think we can initially use one of the OpenFF coverage sets as the molecule collection, and generate (snapshot, energy) pairs by vacuum MD at a reasonable temperature (300K? 500K?) using the forcefield we wish to recover. I wouldn't expect to be able to "generalize across molecules" particularly well from a minimal coverage set, since the set may exercise each FF parameter only a few times, but once we're satisfied with training-set performance there we can move onto something bigger and nicer like the Roche or Bayer set.

To be a bit more explicit about a control experiment where I expect to be able to make the training error go nearly to 0, to check that the overall regression setup is workable:

use the same 80-molecule coverage set that was used in OpenFF 1.0 release,
generate and save 1000 ~independent (configuration, bond_energy, angle_energy, torsion_energy, nonbonded_energy) samples of each molecule by 500K vacuum MD using a recent OpenFF forcefield without HBond constraints,
train on each energy component separately and confirm that these sub-problems can each be solved,
train on the sum of all energy components and confirm that the overall problem can be solved.

jchodera commented 4 years ago

I doubt small coverage sets are going to be valuable because they sample chemical space very sparsely. There's really no way to "learn" from that kind of information.

I think the only reasonable approaches here are:

One molecule: Generate an exhaustive dataset for one molecule. This will let us address how well we can learn a detailed potential, and how much data we need to do so.
A very limited but well-sampled set of molecules, like AlkEthOH: This would allow us to see how well we can learn a well-sampled chemical space. We can generate lots of configurations and see again how many conformers/molecule are needed.
A larger molecular set with good coverage of chemical space: The FreeSolv set is a small example, and the parm@frosst parameterized set is a larger example. I'm not sure how many examples we need to really learn a whole force field, but it may be a very large (>100K) number.
Small molecules in solvent.

The process of generating data and training sounds good, though!

maxentile commented 4 years ago

These make sense, thanks!

choderalab / espaloma

Choice of regression target? #2