choderalab / modelforge

Infrastructure to implement and train NNPs
https://modelforge.readthedocs.io/en/latest/
MIT License
9 stars 4 forks source link

Improved the efficiency of the `_subtract_self_energies` function. #121

Closed chrisiacovella closed 1 month ago

chrisiacovella commented 1 month ago

This PR provides improved efficiency of the _subtract_self_energies function in the DataModule (in dataset.py). As mentioned in issue #120, this step was prohibitively expensive for large datasets (2.5 hours on my laptop for ANI2X, over 3 hours on the lilac cluster). This is a simple fix that stores molecule self energy in a dictionary, so we do not need to recompute for each conformer of a molecule. The dictionary key for this is the concatenation of the atomic numbers into a string (separated by spaces); this does not differentiate between different molecules with the same atoms in the same order (but different structures), which is fine, because our calculation of the self energy is just the sum of atomic self energies and does not consider any other possible aspects of the molecule.

Also, I switched to summing a numpy array of the energies, rather than summing via a for loop, also seems to provide roughly a 10% speed up as well.

This reduces the time for remove self energies for ANI2X from 2.5 hours to ~20 minutes on my local machine. For SPICE2 this takes about 11 minutes, so I'm not sure if it's worth doing any further optimization beyond this (e.g., multithreading); Improving the caching to avoid rerunning step altogether is likely a better focus.

Status