choderalab / espaloma

Extensible Surrogate Potential of Ab initio Learned and Optimized by Message-passing Algorithm 🍹https://arxiv.org/abs/2010.01196
https://docs.espaloma.org/en/latest/
MIT License
211 stars 25 forks source link

Choice of regression target? #2

Open maxentile opened 4 years ago

maxentile commented 4 years ago

Currently the regression target is the total energy of each (molecule, configuration) pair, including the prediction of a geometry-independent per-molecule offset and prediction of geometry-dependent "strain" energy. However, for the QCArchive subset @yuanqing-wang is looking at, the variation of the per-molecule offsets initially appears much larger in magnitude than the conformation-dependent variation within a molecule's collection of snapshots.

Should we do something to decompose the variance into these two components, i.e. (1) predict the constant offset for each molecule, and (2) assuming away the constant offset, predict geometry-dependent strain energies for a given molecule? (To target (1), we can assume away any geometry-dependence and try to predict just the energy of a molecule's global minimum snapshot from its topology. To target (2), we can assume away any constant offset and try to minimize standard deviation of the residuals.)

Also, the energy prediction currently does not include electrostatic contribution. Should the regression target be something other than total energy? (Initially, it seems reasonable to target the valence contributions, for example by targeting QM total energy minus a MM-predicted nonbonded contribution, where the MM prediction uses Parsley's partial charges, sigmas, epsilons, combining rules, and exceptions.)

jchodera commented 4 years ago

Our target shouldn't care about the target-dependent offset, should it?

We also can't decompose QM into valence and electrostatics easily (without SAPT-like methods, which can also be problematic).

maxentile commented 4 years ago

Our target shouldn't care about the target-dependent offset, should it?

I guess it depends what the goal is. For modeling the conformational distribution of a given molecule, any constant offset of the energy is of course irrelevant. For estimating a logZ, a constant offset is relevant. (For estimating logZ differences some constant offsets become irrelevant again.)

I think in this project the priority should be the conformation-dependent part, as the offset is not always needed, is not really modeled in MM, and can be obtained by other means if needed.

At least, I would like to separate our reported regression errors into those two tasks, rather than treating both as a single task.

We also can't decompose QM into valence and electrostatics easily (without SAPT-like methods, which can also be problematic).

Sorry, I didn't mean to suggest that QM_total minus MM_nonbonded was a quantity we should try to get by decomposing the results of a QM calculation.

Instead I was suggesting to "freeze" all the parameters of the MM nonbonded model that we plan to use, and fit the MM valence terms to the residual. (The QM doesn't decompose into valence + electrostatic + vdW, but the MM model does.)

A modeling reason to consider doing this is if we have a model whose LJ parameters have important information about condensed-phase or intermolecular behavior "baked in" that we (1) don't expect to be able to infer reliably from QM energies of isolated small molecules in vacuum or (2) risk messing up by fitting to QM energies of isolated small molecules in vacuum.

A numerical reason to consider doing this -- at least initially -- is that the nonbonded terms involve more aggressive exponents than the valence terms, and I think it is good to start with variants of an approach that are more likely to be numerically stable before proceeding to more complete but more challenging variants. (Looking at reports @yuanqing-wang has generated from initial experiments that included LJ but not electrostatics in a model for total energy, numerical stability does seem to be a relevant concern here.)

jchodera commented 4 years ago

A numerical reason to consider doing this -- at least initially -- is that the nonbonded terms involve more aggressive exponents than the valence terms, and I think it is good to start with variants of an approach that are more likely to be numerically stable before proceeding to more complete but more challenging variants.

Another widely-supported possibility that is less "aggressive" is to use exponential-6 (Buckingham) instead of LJ 12-6: https://en.wikipedia.org/wiki/Buckingham_potential

maxentile commented 4 years ago

In addition to solidifying choice of what quantities we want to regress on (relative potential energy, vs. relative potential energy minus certain nonbonded terms), I think we need to narrow down a bit the collection of molecules, the way that the snapshots are generated, and the way the target energies are computed.

I think so far @yuanqing-wang has mostly looked at molecules in the ANI dataset (very off-equilibrium, but with snapshots further filtered by an energy threshold), QM9 dataset (minimized), and samples from some QCArchive datasets (usually nearly minimized, sometimes generated by torsion scans).

For the positive control experiments where we seek to recover a molecular mechanics energy model, I think we can initially use one of the OpenFF coverage sets as the molecule collection, and generate (snapshot, energy) pairs by vacuum MD at a reasonable temperature (300K? 500K?) using the forcefield we wish to recover. I wouldn't expect to be able to "generalize across molecules" particularly well from a minimal coverage set, since the set may exercise each FF parameter only a few times, but once we're satisfied with training-set performance there we can move onto something bigger and nicer like the Roche or Bayer set.

To be a bit more explicit about a control experiment where I expect to be able to make the training error go nearly to 0, to check that the overall regression setup is workable:

jchodera commented 4 years ago

I doubt small coverage sets are going to be valuable because they sample chemical space very sparsely. There's really no way to "learn" from that kind of information.

I think the only reasonable approaches here are:

The process of generating data and training sounds good, though!

maxentile commented 4 years ago

These make sense, thanks!