Add checks for postprocessing operations

wiederm commented 6 days ago

          Can we add in a check operation here that will validate expected keys are present?

_Originally posted by @chrisiacovella in https://github.com/choderalab/modelforge/pull/169#discussion_r1661574225_

chrisiacovella commented 5 days ago

I'm going to use this issue as a means to discuss ways we might modify the new approach to readout and post processing operation. We obviously want to figure out a way to maintain flexibility, while not obfuscating, losing documentation, or introducing sources of error.

This is going to be long, but I'm trying to just get lots of thoughts on the paper for us to discuss.

Let us start by considering the processing step defined in the .toml file:

processing_operation = [
  {
   in = ["E_i",], 
   out = "E_i", 
   step = "normalization",
   mean = "E_i_mean", 
   stddev = "E_i_mean" 
 },
  { 
   in = ["atomic_numbers", 'atomic_subsystem_indices' ], 
   out = "ase", 
   step = "calculate_ase"
  },

]

Right now, all of the .toml files look the same, the only differences is some have the calculate_ase step and some do not. I think having the processing steps defined in the .toml file is very useful as it makes it very clear what is being done and easier to toggle steps on and off. However, I'm not sure we need to explicitly define inputs and such.

I think it might be clearer to just provide bool triggers for functions, e.g.,:

[normalize_energy = True, calculate_ase = True]

Right now, I can't think of any cases, where we could use another key for the mean and stddev or any case where we would be applying this to anything but the atomic energy E_i. In reality, does the stats store anything else that could be used in place of the mean and stdev of the atomic energies...since this specifically references the stats information, it seems like giving it another key in there would cause a failure, since it is explicitly looking at that dictionary. This might just introduce more chances for error, (note the committed files actually have an error where stddev is being set to the key for the mean E_i_mean). Same thing with ASE, there isn't much need to require something different here. If someone did modify the in it would be wrong (and wrong if the order is swapped).

And while we can argue a user might never want to modify this, presenting this with internal dictionary keys adds an extra level of possible confusion (especially since something like E_i_mean is not part of the outputs dictionary, but rather the dataset_stats, and is buried is not in models.py where this processing occurs). Hiding all of internals seems like a better abstraction.

Especially for the processing part, I feel like if we want to do something different, we would still need to implement code. The loops to process this only support normalization and calculate_ase; if we gave it some other operations it would do nothing at this point.

I do like the idea where it is clearly defined the name of the outputs, so we know how to access these parameters e.g.:

[{step =normalize_energy , output = 'E_i'}, {step =calculate_ase, output = ‘E_ase']

So basically, I’m not sure what we specifically gain in having this extra level of flexibility here, that could otherwise be accomplished with some more simple bools or switching toggles.

Looking at the readout, right now there is:

readout_operation = [
  { step = "from_atom_to_molecule", mode = 'sum', in = 'E_i', index_key = 'atomic_subsystem_indices', out = 'E' },
  { step = "from_atom_to_molecule", mode = 'sum', in = 'ase', index_key = 'atomic_subsystem_indices', out = 'mse' },

]

The readout seems like it might actually benefit more from flexible operations, but I think it might be a lot clearer just to have many of the standard operations already coded in and just be able to toggle them on and off and set their output name.

Again, my concern here is that it could be confusing because one needs to know a lot about internal data representations and key names. So following the approach above:

readout_operation = [
{step=‘energy_of_molecule’, out='E_molecule'}
{step=‘self_energy_of_molecule`, out='E_mse'}
]

although E_mse might not be the best choice, since don’t want anyone thinking mean squared error. We could probably be more descriptive here as well, say E_molecule_self_energy is clearer.

There is probably a reasonably small subset of operations we’d think to calculate for readout, so it shouldn’t be too hard to enumerate those.

I understand there could be some cases where we would want some “new” readouts, but I also think any custom readouts we implement would be of general interest, and thus worth simply adding into the code.

I think we could easy to allow custom operations here in a similar way you’ve done (easily toggled by adding in some key like “custom_readout’ to know to parse it), but I feel like trying to document fully how to do these custom function could be more problematic than just adding in a new function in the code.

I like the idea of making the functions general (e.g., like from_atom_to_molecule) and just having some specific wrappers for say energy or charge.

I suppose we could put these in as separate python file like “readout.py” that contains all the possible readout function we have defined, so it is obvious what is available, how they are structured, where to add more, and have clear docstrings for each of these.

That last part, about docstrings, I think is a good thing. Making this all flexible means we do not have a nice human readable docstrings of what was done.

Defining readout operations that depend upon other operations, e.g., like something that takes E_molecule, might be trickier with pre-defined functions, as we would need to know the dictionary key defined for the output, but that could be something like an optional argument to the definition.

readout_operation = [
{step=‘energy_of_molecule’, out='E_molecule'}
{step=‘self_energy_of_molecule`, out='E_mse'}
{step='sum_energy_and_self_energy', args=['E_molecule', 'E_mse'], out='E_summed'}
]

We could just have an internal dictionary that associates a user defined "out" to an internal name, {'E_mse' = 'E_molecule_self_energy'}, so one does not need to define arguments, or we just do not allow user specified output names, and instead of predefined names.

I actually think I prefer predefined names as it makes it easier for us to document.

wiederm commented 5 days ago

I think it is a great idea to refactor the postprocessing module. The PR that introduced this (PR 169) was mainly concerned with partitioning the postprocessing from the actual models and moving it outside the core model. I think it is clear that there is much room to improve this, make it more general, and safeguard the operations.

The current structure of a given model is as outlined here: https://github.com/choderalab/modelforge/blob/ref-docs/docs/image/overview_network_v2.png

The postprocessing model gets a dictionary of per-atom properties as input and should perform operations on these to obtain model-specific (and user-specific) output properties. Obvious operations are reduction operations (obtaining per-molecule properties from per-atom properties). The reduction operations depend on the properties a neural network returns and the properties the user requests.

I like that version of the toml file (we might want to improve the variable names):


processing_operation=[normalize_energy = True, calculate_ase = True]

readout_operation = [
{step=‘energy_of_molecule’, out='E_molecule'}
{step=‘self_energy_of_molecule`, out='E_mse'}
]

it will hide much of the complexity. I think we can avoid the complexity for the args and dependent operations for now. Even for more complex operations (like calculating the electronic energies) we can predefined the keys for now.

chrisiacovella commented 2 days ago

I would say that if we are going to have dependent operations, we should just create new functions for those in the code that wraps the existing functions.

The variable out could be an array (with order specified by the function itself).

So if we had a dependent function total_molecule_energy that adds self energy to the energy of the molecule, it could look something like this:

 readout_operation = [
{step=`total_molecule_energy`, out=['E_molecule', 'E_mse', 'E_total']}
]

I would say ultimately, if we want more flexibility in our readout definitions, they should just be encoded as a user-defined python module (rather than trying to effectively implement our own scripting processor).

choderalab / modelforge

Add checks for postprocessing operations #177