[BUG] Single precision training error

denghuilu commented 3 years ago

Summary

Deepmd-kit version, installation way, input file, running commands, error log, etc. version: latest version of devel branch; installation way: python interface with single precision, set cmake_args:

    cmake_args=[
        f"-DTENSORFLOW_ROOT:STRING={tf_install_dir}",
        "-DBUILD_PY_IF:BOOL=TRUE",
        "-DBUILD_CPP_IF:BOOL=FALSE",
        "-DFLOAT_PREC:STRING=low",
    ]

Installation works fine.

pip install .

Looking in indexes: http://mirrors.cloud.aliyuncs.com/pypi/simple/
Processing /root/denghui/dp-api/deepmd-kit
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied: pyyaml in /root/denghui/dp-api/tensorflow_venv/lib/python3.6/site-packages/PyYAML-5.4.1-py3.6-linux-x86_64.egg (from deepmd-kit==1.2.3.dev627+g59c6fde.d20210425) (5.4.1)
Requirement already satisfied: scipy in /root/denghui/dp-api/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==1.2.3.dev627+g59c6fde.d20210425) (1.5.4)
Requirement already satisfied: typing-extensions in /root/denghui/dp-api/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==1.2.3.dev627+g59c6fde.d20210425) (3.7.4.3)
Requirement already satisfied: dargs>=0.2.2 in /root/denghui/dp-api/tensorflow_venv/lib/python3.6/site-packages/dargs-0.2.2-py3.6.egg (from deepmd-kit==1.2.3.dev627+g59c6fde.d20210425) (0.2.2)
Requirement already satisfied: numpy in /root/denghui/dp-api/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==1.2.3.dev627+g59c6fde.d20210425) (1.19.2)
Requirement already satisfied: tqdm in /root/denghui/dp-api/tensorflow_venv/lib/python3.6/site-packages (from deepmd-kit==1.2.3.dev627+g59c6fde.d20210425) (4.59.0)
Building wheels for collected packages: deepmd-kit
  Building wheel for deepmd-kit (PEP 517) ... done
  Created wheel for deepmd-kit: filename=deepmd_kit-1.2.3.dev627+g59c6fde.d20210425-cp36-cp36m-linux_x86_64.whl size=1499796 sha256=4f3dec01afef1c8617cb4e5dafa9bbaeb5e98e2e7b139dd0198215f7dac3f35f
  Stored in directory: /root/.cache/pip/wheels/21/f4/ed/167c943f5247a0b258bf59868ff9e8028e9cf4bd783233c161
Successfully built deepmd-kit
Installing collected packages: deepmd-kit
Successfully installed deepmd-kit-1.2.3.dev627+g59c6fde.d20210425

Steps to Reproduce

cd $deepmd_source_dir/examples/water/se_e2_a
dp train input.json

error occurs:

DEEPMD INFO    ---Summary of DataSystem: training     -----------------------------------------------
DEEPMD INFO    found 3 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                               ../data/data_0/     192       1      80  0.250    T
DEEPMD INFO                               ../data/data_1/     192       1     160  0.500    T
DEEPMD INFO                               ../data/data_2/     192       1      80  0.250    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    ---Summary of DataSystem: validation   -----------------------------------------------
DEEPMD INFO    found 1 system(s):
DEEPMD INFO                                        system  natoms  bch_sz   n_bch   prob  pbc
DEEPMD INFO                                ../data/data_3     192       1      80  1.000    T
DEEPMD INFO    --------------------------------------------------------------------------------------
DEEPMD INFO    training without frame parameter
2021-04-25 17:31:15.020943: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-04-25 17:31:15.021616: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2500000000 Hz
2021-04-25 17:31:15.366409: F tensorflow/core/framework/tensor.cc:665] Check failed: dtype() == expected_dtype (2 vs. 1) float expected, got double
Aborted

Further Information, Files, and Links

WangXinyan940 commented 3 years ago

This bug can still be seen while using C++ interface in devel branch. After checking the inner structure of graph, it can be seen that many of layers need float64 input/output, even though DP is compiled under float precision. I'm not sure if it's the reason of that bug, or it's some special design. Hope such information can help.

The graph is trainned from the case examples/water/se_e2_a without any modification. DP is compiled without HIGH_PREC flag. The freezed graph file and script to print inner operators are attached.

water-se_e2_a.zip

amcadmus commented 3 years ago

This bug can still be seen while using C++ interface in devel branch. After checking the inner structure of graph, it can be seen that many of layers need float64 input/output, even though DP is compiled under float precision. I'm not sure if it's the reason of that bug, or it's some special design. Hope such information can help.

The graph is trainned from the case examples/water/se_e2_a without any modification. DP is compiled without HIGH_PREC flag. The freezed graph file and script to print inner operators are attached.

water-se_e2_a.zip

Not really.

The option FLOAT_PREC compiling flag only controls the floating point precision in the interfaces of deepmd-kit.

If one wants to set the precision in the models, he/she can use the "precision" flag in the descritptors and fitting nets.

denghuilu commented 3 years ago

@amcadmus The reason for this error is that the _prepare_coord_nlist_gpu function in $deepmd_source_dir/source/op/prod_env_mat_multi_device.cc has a bug in its support for single precision.

Here's the detection process:

std::cout << "I'm in prod_env_mat_a 5!" << std::endl;

// prepare coord and nlist
_prepare_coord_nlist_gpu<FPTYPE>(
      context, &tensor_list[0], &coord, coord_cpy, &type, type_cpy, idx_mapping, 
      gpu_inlist, ilist, numneigh, firstneigh, jlist, nbor_list_dev,
      frame_nall, mem_cpy, mem_nnei, max_nbor_size,
      box, mesh_tensor.flat<int>().data(), mesh_tensor_size, nloc, nei_mode, rcut_r, max_cpy_trial, max_nnei_trial);

std::cout << "I'm in prod_env_mat_a 6!" << std::endl;

The result output is:

2021-05-11 12:11:57.966802: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 1700000000 Hz
I'm in prod_env_mat_a 1!
I'm in prod_env_mat_a 2!
I'm in prod_env_mat_a 3!
I'm in prod_env_mat_a 4!
I'm in prod_env_mat_a 5!
2021-05-11 12:11:58.288648: F tensorflow/core/framework/tensor.cc:665] Check failed: dtype() == expected_dtype (2 vs. 1) float expected, got double
Aborted (core dumped)

So the program failed in function _prepare_coord_nlist_gpu.

deepmodeling / deepmd-kit

[BUG] Single precision training error #565