brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics
https://gpumd.org/dev
GNU General Public License v3.0
466 stars 116 forks source link

Inconsistency of training RMSE in loss.out and in energy_train.out and force_train.out #215

Closed ZKC19940412 closed 2 years ago

ZKC19940412 commented 2 years ago

Dear developers for NEP potential:

Hi, when I tried to develop a machine-learned potential using NEP, I noticed that the training RMSE in the loss.out does not agree with back calculations done from the corresponding train.out files. For example, in the last line of the loss.out file, it shows the training RMSE for energy and force are 1.03 meV/atom and 70.39 meV/A respectively. However, if I load in the data point from train.out files and do RMSE calculations, I got 1.10 meV/atom and 79.68 meV/A. Does that mean the train.out and loss.out files updated in different frequency during optimizations? Or did nep.txt, train/test.out and loss.out update in the same frequency? If not then it will be difficult to standardize what is the true error of a NEP potential at a certain step of optimization. I have attached loss.out and train.out files in zip format in case you want to play around with it. (I have used a single structure in testing set so test.out is probably meaningless. )

error_analysis.zip

Thank you so much for reading my message.

brucefan1983 commented 2 years ago

The energy_train.out, force_train.out, and virial_train.out files are updated every 1000 steps, and all the other files are updated every 100 steps. Here is the relevant documentation: https://gpumd.zheyongfan.org/index.php/The_output_files_for_the_nep_executable

The reason for the different output frequencies above is that when the number of structures in train.in size is large, one may want to use mini-batch instead of full-batch for training. In this case, calculating the full predicted data (all the mini-batches) and outputting them to xxx_train.out too frequently will become expensive. Instead, the test.in size is usually smaller and I decided to update the xxx_test.out files every 100 steps.

I guess you have used mini-batch instead of full-batch. In this case, it is expected to have different RMSEs as reported in loss.out and those as calculated from xxx_train.out. This is because the data in loss.out is for particular mini-batches, while those in xxx_train.out are for the whole train.in.

To monitor the training precess, I suggest you simply visualize the data in loss.out, like this:

image

Based on this figure, I can infer that you must have used mini-batch, as the RMSE for force fluctuates a lot. The number 70.39 meV/A you mentioned is the force RMSE for a particular batch, and the number 79.68 meV/A you mentioned is the force RMSE for the whole training data set.

If you use full-batch, you will find that the RMSEs in loss.out at an interger multiple of 1000 generations will be the same as those calculated from xxx_train.out at the same multiple of 1000 generations.

ZKC19940412 commented 2 years ago

Thank you so much for the explanations and yes I have used mini-batch for the training. Another related question along the line: Say I only have nep.txt file saved but want to retrieve the energy and force _train and _test.out files. Is there a way I can do a single point predictions like using the nep executable ? I have tried using a similar way as the "check_force" folder to dump out the force and energy predictions from a single step MD with time step being 0, however, the corresponding RMSE are way different from what I seen output from the GPUMD program and I am not entirely sure what went wrong.

Thank you so much for answering my question.

brucefan1983 commented 2 years ago

In principle, you should get consistent results between the nep executable and the gpumd executable, and I routinely used the example in PbTe/check_force to confirm this.

That said, a better way to do single-point calculations with a trained NEP (as saved to a nep.txt file) is to use the PyNEP or the Calorine packages developed by others, Here are the links:

If you are sure there is inconsistency between the nep executable and the gpumd executable within the GPUMD package, I can have a closer look.

ZKC19940412 commented 2 years ago

Thanks for pointing out useful tool and I manage to find consistency in RMSE from different sources with data from PbTe example. One more question related to actually running GPUMD. Say I want to output the density of the system along the trajectory, would you recommend me adding that output in the cuda source code or post processing from current thermo.out (i.e. volume change) to compute density?

brucefan1983 commented 2 years ago

I prefer you postprocessing thermo.out.

Changing the thermo.out format will cause a lot of changes to the examples, tutorials, and other related packages. If there is no compelling reason, I prefer not changing it. As density can be calculated from box data, I feel it is not mandatory to output it.