deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 499 forks source link

an error during compressing a model: "The name 'DescrptSeA' refers to an Operation not in the graph" #454

Closed zhangyongsdu closed 3 years ago

zhangyongsdu commented 3 years ago

I trained a model with the the latest deepmdkit api branch. More detailed information about the error can be found from the outlog attached below.

DEEPMD INFO stage 1: train or refine the model with tabulation DEEPMD INFO installed to: /tmp/pip-req-build-x5wk7g0h/_skbuild/linux-x86_64-3.7/cmake-install DEEPMD INFO source : v1.2.2-462-gcb8f04f DEEPMD INFO source brach: api DEEPMD INFO source commit: cb8f04f DEEPMD INFO source commit at: 2021-03-23 14:22:23 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build with tf inc: /projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/include;/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/include DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: m3a002 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- 2021-03-29 01:37:51.500692: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-29 01:37:51.501443: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/1.10.7-mlx/lib:/usr/local/python/3.7.3-system/lib:/usr/local/gcc/8.1.0/lib64:/opt/munge-0.5.11/lib:/opt/slurm-19.05.4/lib:/opt/slurm-19.05.4/lib/slurm:/opt/munge-0.5.11/lib:/opt/slurm-19.05.4/lib:/opt/slurm-19.05.4/lib/slurm: 2021-03-29 01:37:51.501502: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2021-03-29 01:37:51.501551: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (m3a002): /proc/driver/nvidia/version does not exist DEEPMD INFO ---Summary of DataSystem-------------------------------------------------------------- DEEPMD INFO found 85 system(s):

2021-03-29 01:38:05.561486: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) 2021-03-29 01:38:05.562771: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2494210000 Hz 2021-03-29 01:38:08.133517: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set DEEPMD INFO training data with min nbor dist: 0.9349186607997055 DEEPMD INFO training data with max nbor size: [34, 16] 2021-03-29 01:40:50.247771: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-29 01:40:50.248306: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-29 01:40:50.249084: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set Traceback (most recent call last): File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/bin/dp", line 8, in sys.exit(main()) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/main.py", line 352, in main compress(dict_args) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/entrypoints/compress.py", line 102, in compress log_path=log_path, File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/entrypoints/train.py", line 211, in train _do_work(jdata, run_opt) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/entrypoints/train.py", line 291, in _do_work model.build(data, stop_batch) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/trainer.py", line 275, in build self.descrpt.enable_compression(self.min_nbor_dist, self.model_param['compress']['model_file'], self.model_param['compress']['table_config'][0], self.model_param['compress']['table_config'][1], self.model_param['compress']['table_config'][2], self.model_param['compress']['table_config'][3]) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/descriptor/se_a.py", line 261, in enable_compression self.table = DeepTabulate(self.model_file, self.type_one_side) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/utils/tabulate.py", line 44, in init self.sel_a = self.graph.get_operation_by_name('DescrptSeA').get_attr('sel_a') File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3854, in get_operation_by_name return self.as_graph_element(name, allow_tensor=False, allow_operation=True) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3726, in as_graph_element return self._as_graph_element_locked(obj, allow_tensor, allow_operation) File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3786, in _as_graph_element_locked "graph." % repr(name)) KeyError: "The name 'DescrptSeA' refers to an Operation not in the graph."**

zhangyongsdu commented 3 years ago

the input for training: { "_comment": "that's all", "model": { "descriptor": { "type": "se_a", "sel": [ 50, 50 ], "rcut_smth": 4.8, "rcut": 5.0, "neuron": [ 20, 40 ], "resnet_dt": false, "axis_neuron": 4, "seed": 145040989 }, "fitting_net": { "neuron": [ 80, 80, 80 ], "resnet_dt": true, "seed": 1932956396 }, "type_map": [ "Mg", "Y" ] }, "loss": { "start_pref_e": 0.01, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "learning_rate": { "start_lr": 0.001, "decay_steps": 2000, "decay_rate": 0.95, "stop_lr": 1e-08 }, "training": { "systems": [ "../data.init/data-20190508/Mg-20190422/BCC/", "../data.init/data-20190508/Mg-20190422/FCC/", "../data.init/data-20190508/Mg-20190422/HCP/", "../data.init/data-20190508/Mg-Y-20190502/dis-a-edge-2/", "../data.init/data-20190508/Mg-Y-20190502/dis-a-edge-4/", "../data.init/data-20190508/Mg-Y-20190502/dis-a-edge-8/" ], "set_prefix": "set", "stop_batch": 2000000, "batch_size": [1,1,1,1,1,1 ], "seed": 4232708341, "_comment": " frequencies counted in batch", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": 1, "save_freq": 1000, "save_ckpt": "model.ckpt", "load_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }

zhangyongsdu commented 3 years ago

I figure it out. the installed deepmd-kit in fact is v1.2.3 even though it is the most recent api version. A model trained by v1.3 is necessary for compression. Could I suggest that the code for trainning in the api brach could be updated to v1.3 in the future? It seems the nuerons in the descripator and fitting net need to be the same. Otherwise I will see an error: ValueError: Dimensions must be equal, but are 16 and 32 for '{{node add_2}} = AddV2[T=DT_DOUBLE](Tanh_1, concat)' with input shapes: [1501,16], [1501,32].

the neruons used that the error occurs are: "neuron": [ 16,16,16 ], "resnet_dt": false, "axis_neuron": 4, "seed": 2332495487 }, "fitting_net": { "neuron": [ 32,64,128,256 ], "resnet_dt": true, "seed": 2644281072 },

njzjz commented 3 years ago

There is a tool in v1.2.3 to convert models from v1.2 to v1.3. You may use dp convert-to-1.3 -h to view the help.

AnguseZhang commented 3 years ago

@denghuilu may help.

amcadmus commented 3 years ago

The input file for model compression should be identical to that used to train the original model.