Closed zhangyongsdu closed 3 years ago
the input for training: { "_comment": "that's all", "model": { "descriptor": { "type": "se_a", "sel": [ 50, 50 ], "rcut_smth": 4.8, "rcut": 5.0, "neuron": [ 20, 40 ], "resnet_dt": false, "axis_neuron": 4, "seed": 145040989 }, "fitting_net": { "neuron": [ 80, 80, 80 ], "resnet_dt": true, "seed": 1932956396 }, "type_map": [ "Mg", "Y" ] }, "loss": { "start_pref_e": 0.01, "limit_pref_e": 1, "start_pref_f": 1000, "limit_pref_f": 1, "start_pref_v": 0, "limit_pref_v": 0 }, "learning_rate": { "start_lr": 0.001, "decay_steps": 2000, "decay_rate": 0.95, "stop_lr": 1e-08 }, "training": { "systems": [ "../data.init/data-20190508/Mg-20190422/BCC/", "../data.init/data-20190508/Mg-20190422/FCC/", "../data.init/data-20190508/Mg-20190422/HCP/", "../data.init/data-20190508/Mg-Y-20190502/dis-a-edge-2/", "../data.init/data-20190508/Mg-Y-20190502/dis-a-edge-4/", "../data.init/data-20190508/Mg-Y-20190502/dis-a-edge-8/" ], "set_prefix": "set", "stop_batch": 2000000, "batch_size": [1,1,1,1,1,1 ], "seed": 4232708341, "_comment": " frequencies counted in batch", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": 1, "save_freq": 1000, "save_ckpt": "model.ckpt", "load_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json" } }
I figure it out. the installed deepmd-kit in fact is v1.2.3 even though it is the most recent api version. A model trained by v1.3 is necessary for compression. Could I suggest that the code for trainning in the api brach could be updated to v1.3 in the future? It seems the nuerons in the descripator and fitting net need to be the same. Otherwise I will see an error: ValueError: Dimensions must be equal, but are 16 and 32 for '{{node add_2}} = AddV2[T=DT_DOUBLE](Tanh_1, concat)' with input shapes: [1501,16], [1501,32].
the neruons used that the error occurs are: "neuron": [ 16,16,16 ], "resnet_dt": false, "axis_neuron": 4, "seed": 2332495487 }, "fitting_net": { "neuron": [ 32,64,128,256 ], "resnet_dt": true, "seed": 2644281072 },
There is a tool in v1.2.3 to convert models from v1.2 to v1.3. You may use dp convert-to-1.3 -h
to view the help.
@denghuilu may help.
The input file for model compression should be identical to that used to train the original model.
I trained a model with the the latest deepmdkit api branch. More detailed information about the error can be found from the outlog attached below.
DEEPMD INFO stage 1: train or refine the model with tabulation DEEPMD INFO installed to: /tmp/pip-req-build-x5wk7g0h/_skbuild/linux-x86_64-3.7/cmake-install DEEPMD INFO source : v1.2.2-462-gcb8f04f DEEPMD INFO source brach: api DEEPMD INFO source commit: cb8f04f DEEPMD INFO source commit at: 2021-03-23 14:22:23 +0800 DEEPMD INFO build float prec: double DEEPMD INFO build with tf inc: /projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/include;/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/include DEEPMD INFO build with tf lib:
DEEPMD INFO ---Summary of the training--------------------------------------- DEEPMD INFO running on: m3a002 DEEPMD INFO CUDA_VISIBLE_DEVICES: unset DEEPMD INFO num_intra_threads: 0 DEEPMD INFO num_inter_threads: 0 DEEPMD INFO ----------------------------------------------------------------- 2021-03-29 01:37:51.500692: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-29 01:37:51.501443: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/openmpi/1.10.7-mlx/lib:/usr/local/python/3.7.3-system/lib:/usr/local/gcc/8.1.0/lib64:/opt/munge-0.5.11/lib:/opt/slurm-19.05.4/lib:/opt/slurm-19.05.4/lib/slurm:/opt/munge-0.5.11/lib:/opt/slurm-19.05.4/lib:/opt/slurm-19.05.4/lib/slurm: 2021-03-29 01:37:51.501502: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303) 2021-03-29 01:37:51.501551: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (m3a002): /proc/driver/nvidia/version does not exist DEEPMD INFO ---Summary of DataSystem-------------------------------------------------------------- DEEPMD INFO found 85 system(s):
2021-03-29 01:38:05.561486: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes) 2021-03-29 01:38:05.562771: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2494210000 Hz 2021-03-29 01:38:08.133517: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set DEEPMD INFO training data with min nbor dist: 0.9349186607997055 DEEPMD INFO training data with max nbor size: [34, 16] 2021-03-29 01:40:50.247771: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-29 01:40:50.248306: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set 2021-03-29 01:40:50.249084: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set Traceback (most recent call last): File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/bin/dp", line 8, in
sys.exit(main())
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/main.py", line 352, in main
compress(dict_args)
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/entrypoints/compress.py", line 102, in compress
log_path=log_path,
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/entrypoints/train.py", line 211, in train
_do_work(jdata, run_opt)
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/entrypoints/train.py", line 291, in _do_work
model.build(data, stop_batch)
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/trainer.py", line 275, in build
self.descrpt.enable_compression(self.min_nbor_dist, self.model_param['compress']['model_file'], self.model_param['compress']['table_config'][0], self.model_param['compress']['table_config'][1], self.model_param['compress']['table_config'][2], self.model_param['compress']['table_config'][3])
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/descriptor/se_a.py", line 261, in enable_compression
self.table = DeepTabulate(self.model_file, self.type_one_side)
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/deepmd/utils/tabulate.py", line 44, in init
self.sel_a = self.graph.get_operation_by_name('DescrptSeA').get_attr('sel_a')
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3854, in get_operation_by_name
return self.as_graph_element(name, allow_tensor=False, allow_operation=True)
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3726, in as_graph_element
return self._as_graph_element_locked(obj, allow_tensor, allow_operation)
File "/projects/lj94/softwares/tensorflow2.4.0_venv_GPU_api/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3786, in _as_graph_element_locked
"graph." % repr(name))
KeyError: "The name 'DescrptSeA' refers to an Operation not in the graph."**