deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.49k stars 510 forks source link

[BUG] Running problem with conda installed deepMD toolkit #1446

Closed halohyx closed 2 years ago

halohyx commented 2 years ago

Dear DeepMD developers, I installed deepMD in server by the method provided by the easy-install method provided by the deepMD official account https://github.com/deepmodeling/deepmd-kit/blob/master/doc/install/easy-install.md#with-conda, the command I was using is listed as below: conda create -n deepmd_tst deepmd-kit=2.0.0=gpu libdeepmd=2.0.0=gpu lammps-dp cudatoolkit=10.1 horovod -c https://conda.deepmodeling.org

And later I tested by "dp -h" command and the output seems that the deepMD was installed correctly: 2022-01-24 20:17:04.096599: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1 WARNING:tensorflow:From /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0 WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0 /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd/lib/python3.9/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged. _bootstrap._exec(spec, module) usage: dp [-h] [--version] {config,transfer,train,freeze,test,compress,doc-train-input,model-devi,convert-from} ...

DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics

optional arguments: -h, --help show this help message and exit --version show program's version number and exit

Valid subcommands: {config,transfer,train,freeze,test,compress,doc-train-input,model-devi,convert-from} config fast configuration of parameter file for smooth model transfer pass parameters to another model train train a model freeze freeze the model test test the model compress compress a model doc-train-input print the documentation (in rst format) of input training parameters. model-devi calculate model deviation convert-from convert lower model version to supported version

But during the use of deepMD, I tested the official water example by using "dp train water.json", but unluckily I got the below result: 2022-01-24 19:49:52.461464: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1 WARNING:tensorflow:From /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/tensorflow/python/compat/v2_compat.py:96: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version. Instructions for updating: non-resource variables are not supported in the long term WARNING:root:Environment variable KMP_BLOCKTIME is empty. Use the default value 0 WARNING:root:Environment variable KMP_AFFINITY is empty. Use the default value granularity=fine,verbose,compact,1,0 /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/importlib/init.py:169: UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged. _bootstrap._exec(spec, module) /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/common.py:334: UserWarning: the key n_neuron is deprecated, please use fitting_neuron instead warnings.warn(f"the key {ii} is deprecated, please use {key} instead") /lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/utils/compat.py:50: UserWarning: It seems that you are using a deepmd-kit input of version 0.x.x, which is deprecated. we have converted the input to >2.0.0 compatible warnings.warn(msg) 2022-01-24 19:50:04.562682: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-01-24 19:50:04.566883: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1 2022-01-24 19:50:04.881041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: pciBusID: 0000:b7:00.0 name: Tesla V100-SXM3-32GB computeCapability: 7.0 coreClock: 1.597GHz coreCount: 80 deviceMemorySize: 31.75GiB deviceMemoryBandwidth: 913.62GiB/s 2022-01-24 19:50:04.881215: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1 2022-01-24 19:50:04.888855: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.10 2022-01-24 19:50:04.888972: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.10 2022-01-24 19:50:04.894870: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10 2022-01-24 19:50:04.896524: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10 2022-01-24 19:50:04.901345: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.10 2022-01-24 19:50:04.903826: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.10 2022-01-24 19:50:04.925130: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.7 2022-01-24 19:50:04.937177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0 2022-01-24 19:50:04.937262: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.10.1 2022-01-24 19:50:07.387275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix: 2022-01-24 19:50:07.387406: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0 2022-01-24 19:50:07.387430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N 2022-01-24 19:50:07.416026: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9774 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM3-32GB, pci bus id: 0000:b7:00.0, compute capability: 7.0) 2022-01-24 19:50:07.416709: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. 2022-01-24 19:50:07.433709: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2700000000 Hz OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 24,25,30,72,73,78 OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #157: KMP_AFFINITY: 6 available OS procs OMP: Info #158: KMP_AFFINITY: Uniform topology OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket". OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket". OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core". OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core". OMP: Info #192: KMP_AFFINITY: 1 socket x 3 cores/socket x 2 threads/core (3 total cores) OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #172: KMP_AFFINITY: OS proc 24 maps to socket 1 core 0 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 72 maps to socket 1 core 0 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 25 maps to socket 1 core 1 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 73 maps to socket 1 core 1 thread 1 OMP: Info #172: KMP_AFFINITY: OS proc 30 maps to socket 1 core 8 thread 0 OMP: Info #172: KMP_AFFINITY: OS proc 78 maps to socket 1 core 8 thread 1 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248784 thread 1 bound to OS proc set 25 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248787 thread 2 bound to OS proc set 30 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248788 thread 3 bound to OS proc set 72 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248789 thread 4 bound to OS proc set 73 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248790 thread 5 bound to OS proc set 78 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248791 thread 6 bound to OS proc set 24 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248785 thread 7 bound to OS proc set 25 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248792 thread 8 bound to OS proc set 30 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248793 thread 9 bound to OS proc set 72 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248794 thread 10 bound to OS proc set 73 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248795 thread 11 bound to OS proc set 78 OMP: Info #254: KMP_AFFINITY: pid 248434 tid 248796 thread 12 bound to OS proc set 24 DEEPMD INFO training data with min nbor dist: 0.8763010118574123 DEEPMD INFO training data with max nbor size: [38, 72] Traceback (most recent call last): File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/bin/dp", line 10, in sys.exit(main()) File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/main.py", line 437, in main train_dp(**dict_args) File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 91, in train jdata = update_sel(jdata) File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 341, in update_sel descrpt_data = update_one_sel(jdata, descrpt_data) File "/lustre/home/acct-msekmr/msekmr/anaconda3/envs/deepmd_tst/lib/python3.9/site-packages/deepmd/entrypoints/train.py", line 318, in update_one_sel if parse_auto_sel(descriptor['sel']) : KeyError: 'sel'

Because I didn't change anything after the installation, and I also tried to install some other versions by changing the specifications in the conda install command, still the same error showed, so can you give me some suggestion about how to solve it? I appreciate a lot for you time!

njzjz commented 2 years ago

Hi, this bug has been fixed in #1253. For this version, I suggest not to use local frame descriptor, but use se_e2_a instead. See https://docs.deepmodeling.org/projects/deepmd/en/v2.0.0/model/overall.html for details.

halohyx commented 2 years ago

Thanks alot! After I changed the descriptor to se_e2_a, the problem got fixed, thanks again for your quick reply and your valueble time~