choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

Atomic self energies from dataset #73

Closed wiederm closed 4 months ago

wiederm commented 4 months ago

We are either accepting atomic self-energies as dictionaries (atomic number: reference energy) or calculating these. But, for many of the datasets theatomic energies are provided in the h5 files. It would be ideal that the self-energies from the dataset are passed to the model without user intervention and can be used to offset the energies. Additionally, we need to expose the atomic self-energies from the dataset so that users can get the atomic self-energies as a dictionary.

chrisiacovella commented 4 months ago

As we just discussed, we might want to create a data structure (e.g., called "Elements" or "PeriodicTable" or something) that allows us to look up self-energy by atomic symbol/number and methodology (which could be the level of theory or linear regression).

E.g.:

energies, energies_metadata  = PeriodicTable.get_energy(species=['H', 'C', 'N', 'O', 'F'], method='B3LYP/6-31G')

where energies would be a dict and energies_metadata could return the source of the individual parameter (as a dict) to be logged.

Alternatively, this data structure could be accessed by giving a dataset name, to get the values published with the paper (or say regressed from the data if not published).

This could be a really simple (and potentially useful) dataset to create with qcarchive as well (for a range of elements, calculate with the various levels of theory we are interested in...what is stored in the class would just be a dump of the data).

This also may be over engineering the problem.

wiederm commented 4 months ago

For now, I have added a AtomicSelfEnergies dataclass, that accepts a dictionary of {"element name" : atomic self-energies} and can then be indexed both with atomic numbers and element name. This is a base class that needs to be initialized at runtime and the dictionary passed so that it can be used in downstream applications.

I believe that this is a good compromise for now.