Failing to learn forces for periodic system even when energies are ignored in loss

atomistic-machine-learning / schnetpack

SchNetPack - Deep Neural Networks for Atomistic Systems

Other

792 stars 215 forks source link

Failing to learn forces for periodic system even when energies are ignored in loss #407

Closed JordD04 closed 2 years ago

JordD04 commented 2 years ago

I've been using SchNetPack for a while now and I've continued to struggle to get force predictions to perform well. The forces are consistently underestimated. I've recently tried training only on forces and although this improves predictions, it's still pretty bad. My understanding is that on initialisation, my NN's always predict forces as ≈0 eV / Å and this improves a little bit during training but the validation loss increases after only a few training iterations. Here's an example: loss_binned forces_binned Changing the size of the training set has a near-zero impact and simplifying the NNs doesn't seem to really mitigate the over-fitting. I've experimented with the learning rate parameters and it's unclear if that's helpful. Most of my forces are very low (90% ≈ 0.01 eV / Å) so I thought that may be the issue but removing most of the low force structures doesn't make as much of an improvement as I would like (this data is shown above). Here's my args.json. I've been able to get reasonably good energies with the same hyperparameters:

{
    "aggregation_mode": "avg",
    "batch_size": 300,
    "checkpoint_interval": 1,
    "contributions": null,
    "cuda": true,
    "cutoff": 5.0,
    "cutoff_function": "cosine",
    "datapath": "../../database.db",
    "dataset": "custom",
    "derivative": "forces",
    "environment_provider": "torch",
    "features": 64,
    "force": null,
    "interactions": 4,
    "keep_n_checkpoints": 3,
    "log_every_n_epochs": 1,
    "logger": "csv",
    "lr": 0.001,
    "lr_decay": 0.8,
    "lr_min": 1e-06,
    "lr_patience": 25,
    "max_epochs": 5000,
    "max_steps": null,
    "mode": "train",
    "model": "schnet",
    "modelpath": "model/",
    "n_epochs": 10000,
    "negative_dr": true,
    "normalize_filter": true,
    "num_gaussians": 30,
    "output_module": "atomwise",
    "overwrite": false,
    "parallel": true,
    "property": "formation_energy_per_atom",
    "rho": {
        "derivative": 1.0,
        "property": 0.0
    },
    "seed": null,
    "split": [
        null,
        null
    ],
    "split_path": "model/split.npz",
    "stress": null
}

I'm training on CASTEP Geometry Optimisation on a periodic ternary system. A colleague is training on VASP MD data (again periodic), for which the forces are higher (~2.5 Å) and is getting better force performance but still not ideal.

Does anyone have an idea on what I'm doing wrong or am I reaching a limitation on SchNetPack?

stefaanhessmann commented 2 years ago

Hey @JordD04 , normally the force predictions should also work well for periodic boundary conditions. Did you check if the datapoints in the db have the pbc=True flag? It also seems like big batches can cause problems with training on forces. Could you rerun the training with a batchsize of 10?

JordD04 commented 2 years ago

Hello @Stefaanhess , Thanks for getting back to me. I've checked my database and the atoms objects all include pbc=True. I've done some testing on batch size but I've been unable to see any real difference. (These tests were performed on a smaller training set than in my original post: 6000 vs 12,000 structures). BS = 150 150_loss BS = 50 50_loss BS = 10 10_loss BS = 1 1_loss

For most of these NN's, the MAE error of the validation set only increases from initialisation. 10_error

smausenberger commented 2 years ago

When you only train energies, does it overfit too? If the energies train well and the forces overfit there might be a problem with your database. Check if the data is in the right order.

JordD04 commented 2 years ago

Hey @smausenberger . I don't have an energy-only NN to hand atm but I trained a potential a little while back that had a Rho of 0.7 for forces and 0.1 for energies. R0_7_error R0_7_loss This bottom picture shows a comparison of the training and validation energies and forces. But this is a query by committee implementation so the errors are smaller here than the errors of the individual NNs. ml_plotter It looks like the energies are being learned just fine here. In my experiments with Rho, i've found that favouring energies makes quite a considerable improvement to the energy error but favouring forces makes only a very small improvement to forces.

What do you mean when you say "check the data is in the right order?". If you're just talking about matching the right energies/forces to their respective atoms object, I'm fairly confident that that is all OK. My cells have different numbers of atoms and if things weren't matching up properly, I imagine SchNetPack would be returning lots of 'list index out of range' errors. I think I'm saving my forces in the correct way. If I do:

dataset = AtomsData(database_file)
atoms, properties = dataset.get_properties(0)
print(properties['forces'])

I receive something like this, which I believe is a numpy array of floats:

[[-0.00251  0.00327  0.00196]
 [-0.00251 -0.00327  0.00196]
 [ 0.00886 -0.00202  0.00848]
 [ 0.00886  0.00202  0.00848]
 [ 0.00403 -0.00076 -0.00332]
 [ 0.00403  0.00076 -0.00332]
 [-0.00387 -0.00578  0.00726]
 [-0.00387  0.00578  0.00726]
 [-0.00392 -0.00012 -0.00257]
 [-0.00392  0.00012 -0.00257]
 [ 0.00367  0.00787  0.0089 ]
 [ 0.00367 -0.00787  0.0089 ]
 [-0.00446 -0.03232 -0.02316]
 [-0.00446  0.03232 -0.02316]
 [ 0.00933  0.00384  0.00322]
 [ 0.00933 -0.00384  0.00322]
 [-0.01113 -0.01258 -0.00078]
 [-0.01113  0.01258 -0.00078]]

smausenberger commented 2 years ago

Your energies don't look that bad. I assume you train the forces as derivatives from energy? If so, the forces should improve with better energies except your reference data is wrong.

With right order I meant to check if the coordinate axes are switched or if the order of atoms in the force data differs from the order of atoms in the input data. That would explain the overfitting. I recently had a similar issue and the problem was that I messed up the reshaping of my force data.

JordD04 commented 2 years ago

I've just made a database from the output of a single CASTEP singlepoint calculation. Energy is fine Forces look totally fine (assuming SchNetPack wants them in the format I've listed in my last comment). Cell parameters look fine. I have the positions in my atoms object in absolute coordinates. Should I be using fractional/scaled coordinates?

stefaanhessmann commented 2 years ago

The positions should be in absolute coordinates, so that is not the problem. I just started a model on some Si structures with energy+forces and will get you updated when I have the results!

stefaanhessmann commented 2 years ago

The first epochs of my training are done and the results do already look very different from yours:

The units are eV/Ang. I trained my model on Si structures with PBC and energy+forces+stress and I use the current dev-branch for training. From your args-file I assume, that you use the current master branch. Although I do not think this makes any difference, you could try to use the current dev branch.

I have some ideas, that you could still try:

For some edge cases the torch provider can cause problems (e.g. pbc structures with atoms outside of the cell). I would try to use ase instead.
Could you also try to use aggregation_mode="sum"? Since you train without energies this should work out well.
Also try normalize_filters=False. I am not sure if that makes a huge difference, but you could try.
Do you use any special split?

If these things do not work, I would suggest to use try the dev branch. This should reproduce my results if your data is ok. I am currently working on a tutorial for the usage of forces+PBC with our new dev branch. I hope to finish it by the end of the week. If the dev branch still produces bad results, the issue could be related to the data. Since you generate the data yourself, you could try to use parameters with higher accuracy.

JordD04 commented 2 years ago

Thanks for getting back so soon, @Stefaanhess . My GPU cluster is being a bit unresponsive today but I will try everything you suggest and report back ASAP. The one point I can report back on right now is my split file. I generate my split files randomly but in such a way that one geometry optimisation is never divided between the training set and the validation set (i.e. one geom walk can only end up in the training set or the validation set).

JordD04 commented 2 years ago

Sorry for the long wait. Apparently many of my cluster's GPUs were being reserved for an event. I reran the training with: "aggregation_mode": "sum", "environment_provider": "ase", and "normalize_filter": true, both individually and together. Unfortunately I get the same results.

My group is using a slightly older version of SchNetPack because we've had trouble getting the newer version to work. I think

    ep_device = torch.device(args.environment_provider_device)
AttributeError: 'Namespace' object has no attribute 'environment_provider_device'

was the error people were reporting to me.

I've attempted to train using the Dev branch but it's causing me a little bother. I generally train with spk_run.py which seems to be absent on the Dev branch. I've been using https://github.com/atomistic-machine-learning/schnetpack/blob/master/src/examples/ethanol_script.py and the ReadTheDocs to aid in writing a script but it looks like a lot has changed in the Dev branch. The AtomsData class has been removed, but I think I can replace that with ASEAtomsData (?). However, schnetpack.atomistic seems to be completely different now. Is there any documentation for this?

In any case, these differences in the code may explain why force training is working for you but not me.

JordD04 commented 2 years ago

Hey @Stefaanhess I've been playing around with the dev branch with training from both the CLI and directly by writing scripts (i've found some relevant tutorials). I had a little trouble getting it running. With the CLI I had a problem with self.trainer missing the _device_type attribute and self.trainer.training_type_plugin missing the should_rank_save_checkpoint attribute. I implemented a hacky fix for both to get it running. I also tried running through a script (via this tutorial https://github.com/atomistic-machine-learning/schnetpack/blob/dev/docs/tutorials/tutorial_03_force_models.ipynb) and again had issues with rank_save_checkpoint, which stopped the best model ever being saved. I just changed the statement to if 1 == 1: as a quick "fix" to get it to run. Logging to csv files is not working as intended either. The data looks like this:

0.008915147,0.206631914,0.040407341,0,39,,,
,,,1,39,0.731627464,0.057642017,
,,,1,49,,,0.016790237
0.008443713,0.184204191,0.037232045,1,79,,,
,,,2,79,0.156106502,0.036610086,
,,,2,99,,,0.005108949

Its all there but its split over multiple files.

That aside: I trained a model and got pretty much the same results that I got when training on the master branch.My input file is here: jord-schnet-dev.txt (it's a .txt because github didn't like my .py), and the force errors are shown below. schnetpack-dev-branch-error

I intend to try again with some slightly different hyperparameters and a slightly different data set but I'm not hopeful that that will be fruitful.

Is there any chance you could share your Si database and your python input file? If I can run that and reproduce your results, I can be satisfied that it's an issue with the data and not a code issue.

Alternatively, maybe I could share my data with you and you could try to train it? Either would be greatly appreciated.

stefaanhessmann commented 2 years ago

Hey @JordD04 , sorry for the late reply. Unfortunately I will not be able to share my dataset since it is related to current research or train any models on your data. I hope you can understand this. I am currently looking to find a good benchmark dataset for PBC structures with forces. Also we are currently working on a collection of example models, which will include the settings for training on forces with PBC structures. I already uploaded my config file for you in this gist. You can just create a folder configs/experiment and save this as <experiment_name>.yaml. Afterwards you should be able to run in on your own data by using spktrain --config-name=configs experiment=<experiment_name> data.datapath=<dataset_path>.db data.num_train=0.8 data.num_val=0.1. I just tried it with the current dev branch and it runs fine. I hope this can somehow help to solve your problem. Let me know if you run into any problems with this gist. I will also give an update here when we find a suitable dataset for PBC with forces.

ktschuett commented 2 years ago

[...] it looks like a lot has changed in the Dev branch. The AtomsData class has been removed, but I think I can replace that with ASEAtomsData (?). However, schnetpack.atomistic seems to be completely different now. Is there any documentation for this?

You can find the documentation here: https://schnetpack.readthedocs.io/en/dev/

stefaanhessmann commented 2 years ago

Hey @JordD04 , I have been looking for an available dataset with pbc + forces and found one for Li4P2O7, that is also used here and here. Although the full tutorial is not ready yet, I will already share my files so you can try to reproduce the results on the same dataset.

Preparing the Dataset

Since the data is not included as a benchmark in spk yet, you will need to download the dataset manually and use my parser to get a valid ase-db. The dataset is available in this capsule and located at: /data/Structure_Database/SystemB_Li4P2O7/RAW_VASP_Structures. Please download all parts of the dataset and extract them to a single folder called data. The names of the different parts have to start with part_ and end with .pkl.
After downloading the data you can use this parser to generate the ase-db. After running the script, you should have a spk-compatible database at data/li4p2o7.db.

Training the Model

In order to train the model on the example data, please make sure to use the current dev-branch.
I updated the config file from my previous comment to match the new dataset. I also did only train on forces without regarding the energy values in the loss function. The updated config file can be downloaded from here. Please download the file to some location with the following folder structure: <path to configs>/config/experiments/li4p2o7.yaml.
After downloading the config file, you should be able to run it with spktrain --config-dir=<path to configs>/config experiment=li4p2o7 data.datapath=<path to data>/li4p2o7.db data.num_train=1000 data.num_val=250. For my testrun I used 1000 training and 250 validation points.

Evaluation

I also made a small evaluation script to test the performance of the trained model. You find the evaluation script in this gist. Just set a valid model_dir and db_path to run it. By changing split to "train_idx" or "val_idx" you can select the other splits.
The evaluation is not very efficient, so i added a n_structures=50 argument to speed it up.

Results

I just ran the training for some epochs with mostly default settings and the loss errors of train and validation look normal to me:
I also evaluated the prediction vs target plots and got these results:

I hope this small tutorial helps to reproduce my results. If the tutorial does not work at some point please let me know. We will soon discuss about benchmark datasets for pbc + forces and afterwards add this tutorial with more details to the project. After testing the results for Li4P2O7 you could rerun the training with my config file and your own data. If there still is a big difference in performance, I assume it must be related to your data.

JordD04 commented 2 years ago

Hey @Stefaanhess . I'm still doing some testing but I thought you might be interested in an update. I tried your tutorial and I was able to reproduce your results. I applied the same input files to my data but it looks like it's struggling: LixFeS2-test LixFeS2-train

I did however try on a different but similar system that I'm also trying to learn and it performs much better there. LiNiS-train LiNiS-test The error distribution is already looking better than what I was getting before and it gets better still if I move from 1000 training structures to 10,000. LiNiS-train-10000

I think the difference in performance may come down to the difference in NN architectures. I have previously only ever trained with SchNet (including with the Dev branch) but noticed that your tutorial uses PaiNN. To test it, I trained 2 NNs with the same split file. The first was trained on SchNet with the default hyperparameters. LiNiS-dev-SchNet The second was trained on PaiNN with the default hyperparameters. LiNiS-dev-PaiNN

The PaiNN plot looks much closer to y=x!

I'm currently working on a learning curve but I'm very optimistic at this point.

Thanks for your help.

stefaanhessmann commented 2 years ago

Hey @JordD04 , I am happy to hear, that you made some progress on this. Sorry, I somehow completely forgot to mention the PaiNN representation. In most cases I would recommend to use PaiNN as the default model. For now I will close the issue, since the model generally seems to work for PBC with forces. If you find some interesting results regarding this topic, feel free to add them to this issue or reopen the issue if you still encounter any problems. I am also interested in the results. Best, Stefaan

JordD04 commented 2 years ago

For those interested in this topic: Here's a new learning curve I put together: learning_curve-LiNiS And here's a plot of DFT vs NN forces for the NN trained on 80,000+ structures: ml_plotter

My error is quite large still (190 eV/Å) but at least my distribution looks like y=x and not y=0. I'm still just using the default parameters so maybe I just need to do some hyperparameter optimisation and fiddle with my training set a bit?

Thanks again for your help!