deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.52k stars 516 forks source link

Reproducibility of LAMMPS run with DP potential #3270

Open hl419 opened 9 months ago

hl419 commented 9 months ago

Summary

Hello,

I’m currently attempting to replicate an NVT simulation. I’ve set the seed for the initial velocity and confirmed that the initial velocity is consistent. I am also using the same machine and same version to run the simulation (single processor). However, I’ve noticed that up to a certain time-step, the positions and velocities start to deviate. I checked the previous cases #1656 and #2270, and it seems this issue came from truncation error. I wonder if there is a way to improve the precision to avoid this from happening? I would like to get a deterministic simulation that I can reproduce with exactly the same result using the same inputs. Thanks!

DeePMD-kit Version

v2.2.1

TensorFlow Version

-

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

No response

Details

I’m currently attempting to replicate an NVT simulation. I’ve set the seed for the initial velocity and confirmed that the initial velocity is consistent. I am also using the same machine and same version to run the simulation (single processor). However, I’ve noticed that up to a certain time-step, the positions and velocities start to deviate. I checked the previous cases #1656 and #2270 , and it seems this issue came from truncation error. I wonder if there is a way to improve the precision to avoid this from happening? I would like to get a deterministic simulation that I can reproduce with exactly the same result using the same inputs.

wanghan-iapcm commented 9 months ago

There is no way of reproducing long MD trajectories due to the chaotic nature of the many-body dynamic systems.

hl419 commented 9 months ago

There is no way of reproducing long MD trajectories due to the chaotic nature of the many-body dynamic systems.

Thanks for the reply, but the deviation starts only after 10 ps. I do not think this should be expected....? The truncation error discussed in #1656 makes more sense to me. I am wondering if there is a way to improve this?

wanghan-iapcm commented 9 months ago

There is no way of reproducing long MD trajectories due to the chaotic nature of the many-body dynamic systems.

Thanks for the reply, but the deviation starts only after 10 ps. I do not think this should be expected....? The truncation error discussed in #1656 makes more sense to me. I am wondering if there is a way to improve this?

I would not expect a consistency beyond 1000 time steps.

arianaqyp commented 9 months ago

There is no way of reproducing long MD trajectories due to the chaotic nature of the many-body dynamic systems.

Thanks for the reply, but the deviation starts only after 10 ps. I do not think this should be expected....? The truncation error discussed in #1656 makes more sense to me. I am wondering if there is a way to improve this?

I would not expect a consistency beyond 1000 time steps.

Hi Wanghan,

Could you please provide further details on this? Feel free to correct me if I'm mistaken. I anticipate that a potential model, once trained, should be deterministic in its inference step, similar to a trained neural network model. Thus, would you consider this potential model (specifically, a deepmd trained potential) to be stochastic? If so, could you explain how it operates in that manner?

Thanks,

Ariana

asedova commented 9 months ago
  1. A SIMULATED MD trajectory in digital/numerical world can totally be deterministic if it is coded that way. It's using RNGs! You can set the seeds. There is nothing inherently non-deterministic EXCEPT... floating point non-associativity with asynchronous parallel processes (see below) and global-state seeds that may not be easily set-able by the user.
  2. LAMMPS does have the option to run deterministically for debugging purposes, you can turn it on but it is not on by default
  3. TF models are only deterministic if you use their new determinism flags. This is because of atomic ops on GPUs and multithreaded CPUs and runtime optimization of implementations. This is actually a big problem in deep learning. To see the extent of this effect in general DL, see some examples here: https://discuss.tensorflow.org/t/reproducible-with-and-without-tf-function/7938/3 https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism

It would actually be interesting to see how non-deterministic deepmd's use of Tensorflow actually is and what this means for an MD trajectory.

njzjz commented 9 months ago

Our customized CUDA OP also uses non-deterministic atomicAdd.

https://github.com/deepmodeling/deepmd-kit/blob/b875ea8f6661b6e1567537ead7e2b4a8b14ea113/source/lib/src/gpu/prod_force.cu#L73

The deterministic implementation may need extra effort, which might not be worth doing.

asedova commented 9 months ago

Our customized CUDA OP also uses non-deterministic atomicAdd.

https://github.com/deepmodeling/deepmd-kit/blob/b875ea8f6661b6e1567537ead7e2b4a8b14ea113/source/lib/src/gpu/prod_force.cu#L73

The deterministic implementation may need extra effort, which might not be worth doing.

Yep--already found this and we have been working on re-coding it. What I am trying to see is if 1. we can add the https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism flag anywhere, or maybe we need the TF1 alternative which is a patch and an export variable (currently confused about what to expect with tensorflow.compat.v1), and where would it go in the code, 2. if there are any unsupported ops still in the code that have no deterministic version. Lots of good info here: https://github.com/NVIDIA/framework-reproducibility/blob/master/doc/d9m/tensorflow.md

asedova commented 9 months ago

I would like to measure how much the custom CUDA kernel contributes, and how much any TF ops contribute. I am wondering if there is a way to use GPU for TF but disable the CUDA prod_force.cu kernel? It seems DP_VARIANT is all or nothing from my tests.

The other thing I am hung up on is trying to print model weights from the frozen model. I can't seem to get any of the tf1 compat methods to do it, I guess since it was made with TF1 on top of TF2 or something. I would love to compare the model weights over multiple "identical" runs of training.

asedova commented 9 months ago

By the way, there seems to be another source of "non-determinism" in the deepmd code that may actually be a bug. I ran the se_e2_a water example a bunch of times and there is non-determinism in the learning rate decay schedule. Using the default input file, sometimes I am getting it go from 1.0e-03 to 3.6e-04 on step 5000, and sometimes it goes from 1.0e-03 to 9.0e-04! This doesn't seem to affect the loss values on the water test, but on a large system with lots of training data it makes a huge difference in the loss values and on the model that is trained.

Digging in, I see the learning rate is scheduled with a tf1 module. Now, this shouldn't be parallel and cause the non-determinism like in the other ops that comes from CUDA atomics, I wouldn't think. Maybe it's an uninitialized variable or something? Or some sort of rounding instability. But this causes dramatic differences in reproducibility of training on identical data with identical settings/hyperparameters and stack.

njzjz commented 9 months ago

I would like to measure how much the custom CUDA kernel contributes, and how much any TF ops contribute. I am wondering if there is a way to use GPU for TF but disable the CUDA prod_force.cu kernel? It seems DP_VARIANT is all or nothing from my tests.

DP_VARIANT=cpu will disable all CUDA customized ops. If you want to disable a single OP, you can comment the following lines: https://github.com/deepmodeling/deepmd-kit/blob/91049df4d4cfbdf5074a4915e4409c01cae2333c/source/op/prod_force_grad_multi_device.cc#L275-L276

The other thing I am hung up on is trying to print model weights from the frozen model. I can't seem to get any of the tf1 compat methods to do it, I guess since it was made with TF1 on top of TF2 or something. I would love to compare the model weights over multiple "identical" runs of training.

This is a bit complex, but the devel branch has implemented this feature (for se_e2_a only) as a part of the multiple-backend support. See #3323.

By the way, there seems to be another source of "non-determinism" in the deepmd code that may actually be a bug. I ran the se_e2_a water example a bunch of times and there is non-determinism in the learning rate decay schedule. Using the default input file, sometimes I am getting it go from 1.0e-03 to 3.6e-04 on step 5000, and sometimes it goes from 1.0e-03 to 9.0e-04! This doesn't seem to affect the loss values on the water test, but on a large system with lots of training data it makes a huge difference in the loss values and on the model that is trained.

Do you change the number of training steps? The learning rate depends on it. https://github.com/deepmodeling/deepmd-kit/blob/91049df4d4cfbdf5074a4915e4409c01cae2333c/deepmd/tf/utils/learning_rate.py#L89-L91

asedova commented 9 months ago

learning rate: Number of steps were the same, nothing was different except I ran it again, I am pretty sure. I will run some more tests to verify.

What I am trying to do is enable TF to run on GPU but disable all the local deepmd CUDA kernels (non-TF). I guess I can go in and comment those all out and then build with GPU to get TF on the device.

Will check out the model checking options, thanks...

asedova commented 8 months ago

So I've done some reproducibility testing just on model training and inference. I ran the exact same training on the same data, same hyperparameters, twice to get 2 "identical" models on two different DFT datasets. I ran this with different sets of training step number. Then I ran dp test on 1000 frames for each model.

I have some baffling results. When I look at the maximum absolute difference in predicted force components (x, y, z) for one system (120 atoms, 110,000 training frames) the variations between "identical" training runs are pretty huge. Some atoms' predicted force components across the two "identical" trainings can be as high as 1 eV/Å. It increases with number of training steps: around 0.2 eV/Å for 100,000 training steps, 0.4 eV/Å for 200K training steps, and over 1 eV/Å for 1M training steps. These numbers were confirmed on a different system running the deepmd-kit container.

For the other system, 623 atoms and ~60K training frames, the maximum absolute difference is much lower, about 1.3e-11 eV/Å for 20K steps, about 1e-10 eV/Å for 100K training steps (this system takes longer to train so I am still getting data for longer training times). But it's a HUGE difference in non-deterministic variation between these systems.

The other thing that is troubling is that for both systems, changing the random seed leads to a max abs difference in predicted force components of around 0.4 eV/Å.

I am sort of wondering if there is some bug in the code or the test module, because none of this makes any sense, especially the massive max differences for the one smaller system.

It would be good to run more tests on other datasets. I found a few things online.

Tests were all run with a pip install of DeePMD-kit on an x86 AMD EPYC CPU + NVIDIA A100 GPU with Ubuntu OS.

njzjz commented 8 months ago

Do you get the same behavior with the CPU, or is it only the behavior of the GPU?

Printing to lcurve.out one step by one step may help find the difference.

Please note, according to TF documentation, tf.Session will introduce non-determinism. I am not sure where the non-determinism comes from, but it seems that it is not expected to get determinism results.

asedova commented 8 months ago

Yes, there should be some nondeterminism with TF. But I didn't expect it to affect the forces THAT much. That's a lot. And it seems strange that it would affect one system so much and not the other.

I will run some tests with CPU-only training and also with the CUDA kernels turned off.

Good idea about printing lcurve in smaller increments, will also try this.

asedova commented 8 months ago

I'm also wondering what it would take to turn on TF determinism in DeePMD. Some detailed notes on doing this can be found here: https://github.com/NVIDIA/framework-reproducibility/blob/master/doc/d9m/tensorflow.md

We are working with Duncan/NVIDIA so we can ask questions. I am just not sure what to do with the tf1 compat API on top of TF2 package. It seems to fall through the cracks. If I were to add tf_determinism.enable_determinism() to DeePMD code, where should it go? Also, tf.keras.utils.set_random_seed(SEED). I can try this if you tell me where I should put these commands.

njzjz commented 8 months ago

For random seed: we don't use any global random seed. Instead, the seed is passed from the input file, like

https://github.com/deepmodeling/deepmd-kit/blob/b875ea8f6661b6e1567537ead7e2b4a8b14ea113/deepmd/utils/network.py#L43

For determinism for tf.compat.v1: I don't know and have never used it. The most helpful thing should be https://github.com/tensorflow/community/pull/346

asedova commented 8 months ago

Yes, that is what I am talking about. Where in the code would be the top-most entrypoint to add this command so it propagates down to all the TF calls? Or maybe it needs to go in multiple places?

mtaillefumier commented 7 months ago

It is possible to obtain the same model parameters with deepmd provided that

The interested reader can try out the open PR combined with these three variables to be added to their scripts.

export TF_DETERMINISTIC_OPS=1
export TF_INTER_OP_PARALLELISM_THREADS=0
export TF_INTER_OP_PARALLELISM_THREADS=0

We successfully ran the training and inference tests more than twice and got the same answer all the time.