CompPhysVienna / n2p2

n2p2 - A Neural Network Potential Package
https://compphysvienna.github.io/n2p2/
GNU General Public License v3.0
217 stars 82 forks source link

Impact of data set normalization on energy RMSE #147

Open moabe84 opened 2 years ago

moabe84 commented 2 years ago

Hi. I am facing an unexpected discrepancy in the energy RMSE when the data set normalization is not used:

Energy RMSE with data set normalization: 1.5 meV/atom Energy RMSE without data set normalization: 4.5 meV/atom

The forces RMSEs are the same in both cases. In principle, data set normalization should not affect the RMSEs. I suspect that this might be due the data set. My data set is obtained from AIMD simulations at 4 different temperatures. I need to know if this discrepancy is acceptable or not. Any comments and suggestions on this issue are greatly appreciated.

Many thanks. Mostafa

philippmisof commented 2 years ago

In general data set normalization can affect the RMSE, especially if you are using the Kalman filter for training with the same set of parameters:

Data Set Normalization. Although in principle not a requirement for successful HDNNP training, it is beneficial to normalize data from reference calculations in such way that the fitting procedure becomes independent of a physical unit system. This is in particular relevant for Kalman filter training because a number of free parameters influencing the fit quality are dependent on the magnitude of numeric values in the data set. Recommendations found in the literature for optimal parameter settings are valid only for normalized data sets.63 Since we aim at training both energies and forces we must ensure that a procedure normalizing both quantities is chosen.

  • from Singraber, A.; Morawietz, T.; Behler, J.; Dellago, C. Parallel Multistream Training of High-Dimensional Neural Network Potentials. J. Chem. Theory Comput. 2019, 15 (5), 3075–3092. https://doi.org/10.1021/acs.jctc.8b01092

If the discrepancy is reasonable is hard to tell (at least for me) without knowing the actual data set. In general I would recommend sticking to a normalized data set. But if you need further information I think @singraber would be the better contact.