Closed ZKC19940412 closed 2 years ago
The energy_train.out
, force_train.out
, and virial_train.out
files are updated every 1000 steps, and all the other files are updated every 100 steps. Here is the relevant documentation: https://gpumd.zheyongfan.org/index.php/The_output_files_for_the_nep_executable
The reason for the different output frequencies above is that when the number of structures in train.in
size is large, one may want to use mini-batch instead of full-batch for training. In this case, calculating the full predicted data (all the mini-batches) and outputting them to xxx_train.out
too frequently will become expensive. Instead, the test.in
size is usually smaller and I decided to update the xxx_test.out
files every 100 steps.
I guess you have used mini-batch instead of full-batch. In this case, it is expected to have different RMSEs as reported in loss.out
and those as calculated from xxx_train.out
. This is because the data in loss.out
is for particular mini-batches, while those in xxx_train.out
are for the whole train.in
.
To monitor the training precess, I suggest you simply visualize the data in loss.out
, like this:
Based on this figure, I can infer that you must have used mini-batch, as the RMSE for force fluctuates a lot. The number 70.39 meV/A you mentioned is the force RMSE for a particular batch, and the number 79.68 meV/A you mentioned is the force RMSE for the whole training data set.
If you use full-batch, you will find that the RMSEs in loss.out
at an interger multiple of 1000 generations will be the same as those calculated from xxx_train.out
at the same multiple of 1000 generations.
Thank you so much for the explanations and yes I have used mini-batch for the training. Another related question along the line: Say I only have nep.txt file saved but want to retrieve the energy and force _train and _test.out files. Is there a way I can do a single point predictions like using the nep executable ? I have tried using a similar way as the "check_force" folder to dump out the force and energy predictions from a single step MD with time step being 0, however, the corresponding RMSE are way different from what I seen output from the GPUMD program and I am not entirely sure what went wrong.
Thank you so much for answering my question.
In principle, you should get consistent results between the nep
executable and the gpumd
executable, and I routinely used the example in PbTe/check_force
to confirm this.
That said, a better way to do single-point calculations with a trained NEP (as saved to a nep.txt
file) is to use the PyNEP
or the Calorine
packages developed by others, Here are the links:
If you are sure there is inconsistency between the nep
executable and the gpumd
executable within the GPUMD package, I can have a closer look.
Thanks for pointing out useful tool and I manage to find consistency in RMSE from different sources with data from PbTe example. One more question related to actually running GPUMD. Say I want to output the density of the system along the trajectory, would you recommend me adding that output in the cuda source code or post processing from current thermo.out (i.e. volume change) to compute density?
I prefer you postprocessing thermo.out.
Changing the thermo.out format will cause a lot of changes to the examples, tutorials, and other related packages. If there is no compelling reason, I prefer not changing it. As density can be calculated from box data, I feel it is not mandatory to output it.
Dear developers for NEP potential:
Hi, when I tried to develop a machine-learned potential using NEP, I noticed that the training RMSE in the loss.out does not agree with back calculations done from the corresponding train.out files. For example, in the last line of the loss.out file, it shows the training RMSE for energy and force are 1.03 meV/atom and 70.39 meV/A respectively. However, if I load in the data point from train.out files and do RMSE calculations, I got 1.10 meV/atom and 79.68 meV/A. Does that mean the train.out and loss.out files updated in different frequency during optimizations? Or did nep.txt, train/test.out and loss.out update in the same frequency? If not then it will be difficult to standardize what is the true error of a NEP potential at a certain step of optimization. I have attached loss.out and train.out files in zip format in case you want to play around with it. (I have used a single structure in testing set so test.out is probably meaningless. )
error_analysis.zip
Thank you so much for reading my message.