v2.0.0-beta.0: Nan in summary histogram #689

deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics

https://docs.deepmodeling.com/projects/deepmd/

GNU Lesser General Public License v3.0

1.45k stars 499 forks source link

v2.0.0-beta.0: Nan in summary histogram #689 #692

Closed amcadmus closed 3 years ago

amcadmus commented 3 years ago

Hello, I try to run v2.0.0-beta.0 with: "loss": { "start_pref_e": 0.02, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0.02, "limit_pref_v": 1". And an error appears: " tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: filter_type_all/bias_2_0_0_1/histogram "

But when I set {"start_pref_v": 0, "limit_pref_v":0}, it works. I was wondering why I can not train with virial, could you please help me what I should do? Thanks,

( Also, it works with v1.3.3 )

Originally posted by @AlexanderOnly in https://github.com/deepmodeling/deepmd-kit/discussions/689

amcadmus commented 3 years ago

@marian-code Could you please take a look at the issue?

marian-code commented 3 years ago

This should be a consequence of numerical instability that can be encountered in the first iterations mostly due to wrong choice of training parameters.

See: SO-1 and SO-2

Change of training parameters should be able to remedy this in most cases. I suggest that this will not be labeled as bug.

We have two options here:

Leave it as is and update the docs with info that if you are encountering this error your train parameters are most likely set wrong.
Catch the exception and do not log data from that iteration but display a clear warning that you are probably doing something wrong and your training will most likely not converge.

Please let me know which one do you prefer.

amcadmus commented 3 years ago

This should be a consequence of numerical instability that can be encountered in the first iterations mostly due to wrong choice of training parameters.

See: SO-1 and SO-2

Change of training parameters should be able to remedy this in most cases. I suggest that this will not be labeled as bug.

We have two options here:

Leave it as is and update the docs with info that if you are encountering this error your train parameters are most likely set wrong.

Catch the exception and do not log data from that iteration but display a clear warning that you are probably doing something wrong and your training will most likely not converge.

Please let me know which one do you prefer.

Thanks for the reply! I realize that it should be attributed to the bug fixed in #685 . Let's see if @AlexanderOnly still encounters the bug after upgrading to v2.0.0-beta2