NWChemEx / SimDE

Software development kit for the NWChemEx Community.
https://nwchemex.github.io/SimDE/
Apache License 2.0
0 stars 0 forks source link

Property types for optimization #135

Open keceli opened 10 months ago

keceli commented 10 months ago

PR Type

Brief Description

This PR introduces new property types:

  1. MoleculeFromMolecule: As the name suggests, takes a molecule, returns a molecule. This could be used for different purposes. Here, I was mainly thinking of using it together with AOEnergy property type for a module to return optimized coordinates and energy.
  2. Optimize<PT2Optimize, WithRespectTo>: Templated property type that mimics [Derivative property type](https://github.com/NWChemEx/SimDE/blob/d3c6743cb145370d8ff3174bc3b1a66b33124b01/include/simde/derivative/derivative_pt.hpp#L50). I am not quite happy with the type def I chose (OptimizeCoordinates), so please let me know if you have a suggestion.

Not In Scope

PR Checklist

ryanmrichard commented 10 months ago

Every time a module is called it can only be run as a single property type. So if a module were to satisfy both AOEnergy and MoleculeToMolecule you'd end up doing something like:

auto e = mod.run_as<AOEnergy>(aos, mol);
// Would actually have to copy mod, since it'll be locked after the previous call
mod.change_input("basis set", aos); // Need to bind the AOs
auto opt_geom = mod.run_as<MoleculeToMolecule>(mol);

Memoization should take care of avoiding the need to redo the optimization. The need to treat the AOs as an implicit input in the second call is messy. That said, I think the biggest surprise with this design is that that first call would not return the energy for mol, but for opt_geom. I guess you could call the MoleculeToMolecule one first, then pass in opt_geom, but requiring a specific invocation order is weird too.

Alternatively perhaps you had envisioned that the module only satisfy MoleculeToMolecule and not expose the fact that it calls an Energy module. That reduces it to:

mm.change_input("energy module", "basis set", aos);
mod.change_submodule("energy", "energy module");
auto opt_geom = mod.run_as<MoleculeToMolecule>(mol);
auto e = mm.run_as<Energy>("energy module", geom);

This still has the same problems with mapping mol vs. opt_geom to e.

With respect to the Optimize<AOEnergy, Molecule> PT (should it actually be Nuclei?), everything you need is in a single call:

auto [e, opt_geom] = mod.run_as<Optimize<AOEnergy, Molecule>>(aos, mol, mol);

plus, since this is an Optimize property type (as opposed to AOEnergy) it's not surprising that the resulting energy is for opt_geom. FWIW we should be able to default the third argument to the second's value to simplify the call a bit. This also have the advantage of being able to express constrained optimization natively, i.e., the third argument can be a subset of the second. I'm not sure how one would do that with either of the other scenarios.

keceli commented 10 months ago

Alternatively perhaps you had envisioned that the module only satisfy MoleculeToMolecule and not expose the fact that it calls an Energy module.

Yes, this is the approach I prefer. After getting the optimized geometry, one can get the initial or optimized energy from the cache.

This still has the same problems with mapping mol vs. opt_geom to e.

I am not sure about the problem here. If the user pass the initial molecule will get the initial energy, and the final energy if the optimized molecule is passed. Maybe I am missing smt.

Optimize<AOEnergy, Molecule> PT (should it actually be Nuclei?),

I followed the derivative property type checked in. Same applies there and I thought Molecule could be preferred as it is more user friendly.

ryanmrichard commented 10 months ago

Alternatively perhaps you had envisioned that the module only satisfy MoleculeToMolecule and not expose the fact that it calls an Energy module.

Yes, this is the approach I prefer. After getting the optimized geometry, one can get the initial or optimized energy from the cache.

Generally speaking it is usually better to be explicit. The property type is a contract. MoleculeFromMolecule (I realize I had the name wrong before) only guarantees that the module takes an input molecule and returns an output molecule. How it goes from the input to the output is up to the module. In particular, this frees it up for the module to delete intermediate data. Of course, PluginPlay allows you to go in and earmark data for archival, but that's an extra step, one that people will loathe when their 2 month CCSD(T) geometry optimization throws away the intermediate energies because they forgot to save them. If the energy is explicitly part of the PT, the module has no choice but to return it.

I'll also point out that at the moment the cache is really designed as a checkpoint/restart mechanism, not an on-demand database. Those two scenarios are different use cases and require different considerations.

This still has the same problems with mapping mol vs. opt_geom to e.

I am not sure about the problem here. If the user pass the initial molecule will get the initial energy, and the final energy if the optimized molecule is passed. Maybe I am missing smt.

So you're after an API like:

auto e = mod.run_as<AOEnergy>(aos, mol);
auto opt_mol = mod.run_as<MoleculeFromMolecule>(mol);
auto e_opt = mod.run_as<AOEnergy>(aos, opt_mol);

So the first call would presumably optimize mol and return its original energy. The next two calls would then be memoized. Without knowing the details of the module (something we're trying to avoid the user needing to know) most people are going to wonder why the first line takes orders of magnitude more time then a single point evaluation. Again explicit is better than implicit; so having the first call somehow signal to the user that it's doing an optimization is IMHO better.

Optimize<AOEnergy, Molecule> PT (should it actually be Nuclei?),

I followed the derivative property type checked in. Same applies there and I thought Molecule could be preferred as it is more user friendly.

It's also potentially misleading. Nuclei are just the nuclei. Molecule also includes electrons. If you tell a user that you are optimizing with respect to "nuclei plus electrons" they could conceivably wonder if that means you're also optimizing somehow with respect to the electrons (for example by sampling electronic states). Nuclei is more explicit (though I guess people could conceivably wonder if you're optimizing the masses/atomic numbers; so maybe it should just be with respect to the PointSet?).

I realize this isn't how other electronic structure codes function. So there may be some surprise for users who transition; however, I'd argue (as I did above) that our approach is more explicit and better maps to nature. FWIW, many other codes force you to explicitly disambiguate their inputs too, it usually just happens via an option instead of by manipulating an input object. The goal of an object-oriented design is to express intent with objects and not rely on non-local state like options tend to be.