brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics
https://gpumd.org/dev
GNU General Public License v3.0
450 stars 114 forks source link

RMSE unreasonably low: Constant offset in NEP predicted energies #205

Closed NicklasOsterbacka closed 2 years ago

NicklasOsterbacka commented 2 years ago

I have trained a NEP model on ~130 structures; the model itself seems fine for the most part, but the predicted energies are rather odd. A parity plot based on the contents of energy_train.out has been attached, and there is an offset of ~2.7 meV for the predicted energies.

The corresponding RMSE calculated based on the definition is ~2.7 meV/atom, but the energy RMSE as per the loss.out file is ~0.00011 meV/atom. This is precisely the value I get when I subtract the average offset from the NEP predictions before calculating the RMSE.

This behaviour seems very strange to me! Is there a reason it's done this way? Is the RMSE calculated the same way when computing the energy loss during training?

parity

brucefan1983 commented 2 years ago

I have not seen similar results before. Could you post your nep.in and plot all the curves in loss.out? Mentioning the GPUMD version would be helpful too.

Zheyong

NicklasOsterbacka @.***> 于 2022年6月22日周三 18:23写道:

I have trained a NEP model on ~130 structures; the model itself seems fine for the most part, but the predicted energies are rather odd. A parity plot based on the contents of energy_train.out has been attached, and there is an offset of ~2.7 meV for the predicted energies.

The corresponding RMSE calculated based on the definition is ~2.7 meV/atom, but the energy RMSE as per the loss.out file is ~0.00011 meV/atom. This is precisely the value I get when I subtract the average offset from the NEP predictions before calculating the RMSE.

This behaviour seems very strange to me! Is there a reason it's done this way? Is the RMSE calculated the same way when computing the energy loss during training?

[image: parity] https://user-images.githubusercontent.com/58740150/175063537-be563644-c8e0-4ccc-bd82-68dde9c83b30.png

— Reply to this email directly, view it on GitHub https://github.com/brucefan1983/GPUMD/issues/205, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF546OMQFJAC3W77RPUFVADVQMVXPANCNFSM5ZQUYPSQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

NicklasOsterbacka commented 2 years ago

nep.in looks as follows:

version  3
type  2 Si C
lambda_1  0.1
lambda_2  0.1
lambda_f  20
cutoff  6 6
generation  100000

I cloned and compiled the git commit with hash ccc37454f4027a205129a2b56bdef7e6a9ef42bf, which corresponds to GPUMD 3.3.1. I observe the same behaviour with a model trained using GPUMD 3.2, though.

I use dummy data for the test set, but I have attached a plot for all the training data.

train

brucefan1983 commented 2 years ago

Perhaps it's due to the too small energy weight (or too large force weight). Usually the default weights are a good starting point.

NicklasOsterbacka @.***> 于 2022年6月22日周三 19:37写道:

nep.in looks as follows:

version 3 type 2 Si C lambda_1 0.1 lambda_2 0.1 lambda_f 20 cutoff 6 6 generation 100000

I cloned and compiled the git commit with hash ccc37454f4027a205129a2b56bdef7e6a9ef42bf, which corresponds to GPUMD 3.3.1. I observe the same behaviour with a model trained using GPUMD 3.2, though.

I use dummy data for the test set, but I have attached a plot for all the training data.

[image: train] https://user-images.githubusercontent.com/58740150/175088670-9bcb12b2-8b11-4f50-82d3-991997b1f923.png

— Reply to this email directly, view it on GitHub https://github.com/brucefan1983/GPUMD/issues/205#issuecomment-1163352020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF546OO4WJFZCHDGFXK2ADDVQM6LPANCNFSM5ZQUYPSQ . You are receiving this because you commented.Message ID: @.***>

NicklasOsterbacka commented 2 years ago

Yes, the defaults overall seem to be pretty great! I increased lambda_f at the suggestion of my colleague Paul Erhart, the supervisor of @elindgren.

As mentioned, I ran into similar issues with a model trained with GPUMD 3.2; see the parity plot below. nep.in as follows:

version  3
type  2 Si C
lambda_1  0.1
lambda_2  0.1
lambda_f  1.5
generation  100000

While lambda_f is higher than the default, it's not by much. The offset is of the same order of magnitude as the case above. I can rerun with the default settings across the board if that would be helpful. parity

brucefan1983 commented 2 years ago

Could I ask what do you meam by dummy data for the test set? The same as train.in?

NicklasOsterbacka commented 2 years ago

I am doing k-fold cross-validation for validation, so I don't need a test set. If I recall correctly, GPUMD would not run without a test set present, so it contains a single structure from my dataset.

brucefan1983 commented 2 years ago

I guess it is due to a very small test.in (e.g., only one or a few structures). The code relies on the structures in test.in to correct the last bias parameter in the neural network. This assumed that the test.in file contains a representative set of structures with reasonable energies, which are supposed to be similar to those in train.in. This is just my guess. If you don't have enough testing structures, you can simply use the same as in train.in.

brucefan1983 commented 2 years ago

Ah, you only have a single structure in test.in. That is the reason. Will using the same structures as in train.in for test.in ruin your k-fold cross-validation? If not, you can simply copy test.in from train.in. I am sure you will not see the shift any more.

brucefan1983 commented 2 years ago

I will make a fix for you. Very easy to do.

brucefan1983 commented 2 years ago

I just checked the PbTe example, keeping only one structure in test.in and I still have not seen a clear shift of energy. Does it mean the structure in your test.in is not very consistent with those in train.in?

NicklasOsterbacka commented 2 years ago

The code relies on the structures in test.in to correct the last bias parameter in the neural network.

Interesting! So you are essentially using the test set for bias fitting, then? I didn't expect the test set to affect the training set loss!

It also seems like the RMSE in loss.out takes this correction into account somehow, while NEP predictions don't. As I mentioned earlier, manual calculation of the RMSE based on energy_train.out yields vastly different results unless I manually correct for the apparent shift.

If you don't have enough testing structures, you can simply use the same as in train.in.

I will give that a shot and get back to you as soon as that finishes up!

I just checked the PbTe example, keeping only one structure in test.in and I still have not seen a clear shift of energy. Does it mean the structure in your test.in is not very consistent with those in train.in?

It's possible! I am training two models on two very similar systems simultaneously, and I might have copied a structure from the other dataset into test.in here, since I didn't expect it to matter whatsoever. I will try training another model, this time making sure that test.in comes from train.in, and get back to you!

elindgren commented 2 years ago

Hijacking the thread in the meantime for a tangent.

I guess it is due to a very small test.in (e.g., only one or a few structures). The code relies on the structures in test.in to correct the last bias parameter in the neural network. This assumed that the test.in file contains a representative set of structures with reasonable energies, which are supposed to be similar to those in train.in. This is just my guess. If you don't have enough testing structures, you can simply use the same as in train.in.

Is this correction of the bias parameter done when training the NEP model? In other words, is the energy-part of the loss used to update the parameters calculated with this correction? If so, doesn't this constitute a form of test-set contamination since information from the test-set is effectively used to train the model? I would have expected such a correction to be based on the training set, not the test set.

brucefan1983 commented 2 years ago

You are right. It might be better to change to use the training data to fix this bias. However, I believe if the energies in all the training and testing sets are consistent, using the energy of any structure is ok. This bias is just a global constant (reference energy) which will not affect the other paramerers in the NN.

brucefan1983 commented 2 years ago

I mannually determine the bias parameter just because this can lead to much faster convergence of the loss in the training. The reason is that this bias is just a global reference value and should not interference the training of the other parameters in the NN.

brucefan1983 commented 2 years ago

Here is the relevant code (src/main_nep/fitness.cu):

image

In line 154, the bias parameter is corrected based on the energy_shift_per_structure calcualted from the testing set. I can change to use the energy_shift_per_structure calculated from (a batch of) the training set, if you think that is more reasonable.

That said, I still think it does not matter what structures are used to fix this single bias parameter as long as the energy reference values are consitent in all the structures.

elindgren commented 2 years ago

I mannually determine the bias parameter just because this can lead to much faster convergence of the loss in the training. The reason is that this bias is just a global reference value and should not interference the training of the other parameters in the NN.

I agree, it makes sense from a training perspective. The idea is not very far from standardization of the input energies, because what you effectively do is transform the mean energy to zero from the perspective of the network, which in general improves convergence. Standard practice is to perform this standardization using the training set though. There is probably not too much issue in most cases with taking the mean from the test set as you say, but in specific cases such as @NicklasOsterbacka's it leads to some unexpected behaviour.

NicklasOsterbacka commented 2 years ago

A middle-of-the-ground approach could be to initialize the bias parameter based on the training set, in a fashion similar to what you do now, and then fit it along with the rest of the parameters. The way it's done now seems very unconventional, though it apparently works well enough in most cases.

It does not seem to explain why the RMSE differs so much between loss.out and a manual calculation based on energy_train.out, though. The former is rather misleading in this particular case.

brucefan1983 commented 2 years ago

@elindgren Good point. Let's see what @NicklasOsterbacka will show later. If needed, I can definitely change the code, which just involes the reordering of a few lines.

elindgren commented 2 years ago

Just for clarification, is the bias included amongst the fitted parameters, or is it only set manually? Usually, one standardizes the data (mean=0) and then trains the model, including the bias. I would think expect that setting bias=mean would not be optimal for some cases, for instance where the data is non-gaussian distributed (e.g., multi-modal distribution), but it is a good starting point for training.

brucefan1983 commented 2 years ago

@elindgren For conventional training algorithms, the method you mentioned might be good enough. However for evolutionary algorithms, I found that manually determining the bias parameter can siginificantly speed up the convergence.

Let me explain it with an example. Suppose the function to be fitted is y = x + 1. In a population with 50 individuals, suppose you have two candidate solutions, one is y = x + 100, another is y = 2 * x + 1. Which one is better? Based the raw results, y = 2 * x + 1 will have a smaller loss. However, if the constant 1 in the target function y = x + 1 is unimportant (which is true for interatomic potentials), I would say y = x + 100 is better (actaully it is a perfect solution with loss = 0). Therefore, it is better to calculate the loss modulo the unimportant constant. In conventional gradient-based algorithms, there is no such population stuff and the bias can be gradually learned, and it will not adversely affect the training process as it would in population-based evolutionary algorithms.

brucefan1983 commented 2 years ago

A middle-of-the-ground approach could be to initialize the bias parameter based on the training set, in a fashion similar to what you do now, and then fit it along with the rest of the parameters. The way it's done now seems very unconventional, though it apparently works well enough in most cases.

It does not seem to explain why the RMSE differs so much between loss.out and a manual calculation based on energy_train.out, though. The former is rather misleading in this particular case

Have you tried to use a structure within train.in for test.in?

elindgren commented 2 years ago

Let me explain it with an example. Suppose the function to be fitted is y = x + 1. In a population with 50 individuals, suppose you have two candidate solutions, one is y = x + 100, another is y = 2 x + 1. Which one is better? Based the raw results, y = 2 x + 1 will have a smaller loss. However, if the constant 1 in the target function y = x + 1 is unimportant (which is true for interatomic potentials), I would say y = x + 100 is better (actaully it is a perfect solution with loss = 0). Therefore, it is better to calculate the loss modulo the unimportant constant. In conventional gradient-based algorithms, there is no such population stuff and the bias can be gradually learned, and it will not adverasely affect the training process as it would in population-based evolutionary algorithms.

Ah I see, good point. But I don't agree that y = x + 100 is a perfect solution with loss 0; in the case of interatomic potentials the force loss would be zero, but not the interatomic potential loss and thus not the predicted structure energy. For most applications (such as MD) this is fine, but in certain cases the energies are also important. From this perspective, wouldn't starting with the bias as the dataset mean and then improving it during training be preferred? If all the terms in the loss function are standardized and equally weighted, I would expect the model to learn the optimal bias (which might be better than the mean).

brucefan1983 commented 2 years ago

The global energy constant is unimportant in any applications. What matters is always energy difference. I still believe the shift as observed by @NicklasOsterbacka is due to inconsistency of energy between the structure in test.in and those in train.in.

elindgren commented 2 years ago

The global energy constant is unimportant in any applications. What matters is always energy difference.

Ah right, fair point :+1:

brucefan1983 commented 2 years ago

If not, I will debug :-)

NicklasOsterbacka commented 2 years ago

They just finished training! I trained them for 150 000 generations, which was probably overkill for this purpose. Oh well.

Training set as test set: all_parity

Single training data point as test set: single_parity

There is no apparent shift anymore. Thanks for helping me figure out the root cause! It seems I used the wrong test data.

The way the output bias is set is definitely unconventional; using the dataset to initialize its value and then fitting would be more in line with standard practice, but I think that would only lead to very minor differences in results when compared to the way it's done now. I definitely agree with @elindgren that setting the bias based on the test data feels like data contamination, though. It should probably be done using the training set instead.

For the project I am working on, we are interested in the energy difference between two NEP models. This "offset effect" thus affects our results to a relatively high degree. As you mentioned, it really doesn't matter too much for the energy landscape of a single model.

brucefan1983 commented 2 years ago

Thanks for the results. How about I change the code to use training data to fix the bias and you give a test?

NicklasOsterbacka commented 2 years ago

Sounds good!

brucefan1983 commented 2 years ago

@NicklasOsterbacka it is in master now.

NicklasOsterbacka commented 2 years ago

Excellent! Thank you for the help.

brucefan1983 commented 2 years ago

One comment: instead of increasing lambda_f to 20, I would decrease lambda_e and lambda_v to small values. The reason is that when you increase lambda_f to 20, the regularization terms are quite small (so the regularization might be insufficient) even if you set lambda_1 = lambda_2 = 0.1.