deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 499 forks source link

[BUG] _An error in `prod_env_mat.cu`_ #675

Closed Ericwang6 closed 3 years ago

Ericwang6 commented 3 years ago

Summary

An error in /source/lib/src/cuda/prod_env_mat.cu

Deepmd-kit v2.0.0b0

When training data for small organic molecules with se_e2_a descriptor, an error occurs: cuda assert: DeePMD-kit: illegal nbor list sorting /home/yingze/deepmd-kit/source/lib/src/cuda/prod_env_mat.cu 509.

My input.json :

{
    "model": {
        "type_map": [
            "C",
            "H",
            "N",
            "O"
        ],
        "descriptor": {
            "type": "se_e2_a",
            "sel": [
                48,
                40,
                48,
                48
            ],
            "rcut_smth": 0.5,
            "rcut": 6.0,
            "neuron": [
                20,
                40,
                80
            ],
            "resnet_dt": false,
            "axis_neuron": 8,
            "type_one_side": true,
            "seed": 1,
            "activation_function": "gelu"
        },
        "fitting_net": {
            "neuron": [
                240,
                240,
                240
            ],
            "resnet_dt": true,
            "seed": 1,
            "activation_function": "gelu"
        }
    },
    "learning_rate": {
        "type": "exp",
        "start_lr": 0.0001,
        "stop_lr": 5e-8,
        "decay_steps": 500
    },
    "loss": {
        "type": "ener",
        "start_pref_e": 0.02,
        "limit_pref_e": 10,
        "start_pref_f": 1000,
        "limit_pref_f": 1,
        "start_pref_v": 0,
        "limit_pref_v": 0
    },
    "training": {
        "numb_steps": 100000,
        "disp_file": "lcurve.out",
        "disp_freq": 1000,
        "numb_test": 1,
        "save_freq": 1000,
        "save_ckpt": "model.ckpt",
        "disp_training": true,
        "time_training": true,
        "training_data": {
            "batch_size": "auto",
            "systems": [
                "./C0H0N0O2",
                "./C1H3N1O1"
            ]
        }
    }
}

The program is run on pbs system.

Steps to Reproduce

An example of relevant data is attached here: issue.zip

njzjz commented 3 years ago

742 reported a same bug.

iProzd commented 3 years ago

This bug occurs when:

  1. using gelu as activation function (GPU environment);
  2. there's one type (or types) of atom not showing in the system but in the type_map;

which causes the empty input in gelu.cu and thus breaks.