choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

HDF5 file structure; encoding information about array axis #29

Closed chrisiacovella closed 8 months ago

chrisiacovella commented 9 months ago

The general design of the curated hdf5 files includes grouping together conformers into a single entry. For example, coordinates are stored as an [m, n, 3] array for each molecule (where m is number of conformers, n is number of atoms, 3 correspond to x,y,z).

For a quantity such as atomic_numbers, which will be fixed for all conformers, we do not need to make m copies of the same data. To differentiate between these two types of data, I added in an attribute call "series" where True would indicate we have a series of per-conformer entries, and False would indicate something like atomic_numbers, which apply to all the conformers.

{conformerinfo}{property_info}

conformer_info could be: series or single proper_info could be: atom or mol

More generally, I think need to ensure that, e.g., when defining energy per molecule, we have an array of size [m,1] rather than [m], such that how we interrogate the array (i.e., look at .shape) is the same for a quantity that may have a spatial dependence (e.g., shape [m,3]).