RFC: Refactor DPGEN2 with a new design

link89 commented 7 months ago

Hi community,

This RFC is about a proposal to refactor DPGEN workflow with a new design based on DFlow

A typical DPGEN2 configuration is like the below: https://github.com/deepmodeling/dpgen2/blob/master/examples/chno/input.json

IMHO there are some issues in the configuration:

The context (executor, container, etc) configuration is mix with the configuration of algorithm
It is hard to validate such configuration with tool like pydantic, which would be error prone
Data files are not allowed to carry their own configuration, which makes it hard to training different systems at the same time.

A suggested pseudo configuration design is like the below, which borrow some ideas from ai2-kit project. This configuration is supposed to be more formal and clean to maintain.

# executor configuration
executor:
  bohrium: ...

# dflow configuration for each software
dflow:
  python:
    container: ai2-kit/0.12.10
    python_cmd: python3
  deepmd:
    container: deepmd/2.7.1
    dp_cmd: dp
  lammps:
    container: deepmd/2.7.1
    lammps_cmd: lmp
  cp2k:
    container: cp2k/2023.1
    cp2k_cmd: mpirun cp2k.psmp

# declare file resources as datasets before use them
# so that we can assign extra attributes to them
datasets:
  dpdata-Ni13Pd12:
    url: /path/to/data
    format:  deepmd/npy

  sys-Ni13Pd12:
    url: /path/to/data
    includes: POSCAR*
    format: vasp
    attrs:
    # allow user to defined system-wise configuration
    # so that we can explore multiple types of systems in an iteration
      lammps:
        plumed_config: !load_text plumed.inp # use custom yaml tags to embed data from other file
      cp2k:
        input_template: !load_text cp2k.inp

workflow:
  general:
    type_map: [C, O, H]
    mass_map: [12, 16, 1]
    max_iters: 5

  train:
    deepmd:
      init_dataset: [dpdata-Ni13Pd12]
      input_template: !load_yaml deepmd.json  # use custom yaml tags to embed data from other file

  explore:
    # instead of using `type: lammps` to specific different software
    # specific a dedicated entry for different softwares of the same stage
    # so that we can use pydantic to validate the configuration item
    # and lead to a better code structure:
    # https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L163-L293
    lammps:
      nsteps: 10
      systems: [ sys-Ni13Pd12 ]  # reference dataset via key
      # support different way of variable combination strategies to avoid combination explosion
      # vars defined in `explore_vars` will combines with system_files with Cartesian product
      # vars defined in `broadcast_vars` will just broadcast to system_files
      # this design is useful if there are a lot of file
      explore_vars:
        TEMP: [330, 430, 530]
      broadcast_vars:
        LAMBDA_f: [0.0, 0.25, 0.5. 0.75. 1.0]
      template_vars:
        POST_INIT:  |
          neighbor bin 2.0
      plumed_config: !load_text plumed.inp

   # isolated select stage from explore so that we can implement more complex structure selection algorithm
  select:
    model_devi:
      decent_f: [0.12, 0.18]
    limit: 50

  label:
    cp2k:
      input_template: !load_text cp2k.inp

next:
  # specify configuration for next iteration
  # it will merge with the current configuration as a new configuration file for next round
  config: !load_yml iter-001.yml

The above configuration is easy to validate with pydantic, for example: https://github.com/chenggroup/ai2-kit/blob/main/ai2_kit/workflow/cll_mlp.py#L32-L111

I believe a better design of configuration will lead to a better software design. I post my thoughts for the community to review, and it would be appreciated to get some feedbacks.

zjgemi commented 6 months ago

For the first point, it is quite easy to put the machine-related configurations together or in a separate file, in my opinion, this is a minor point. For the second point, dpgen2 uses dargs for configuration checks and validation as well as automatic documentation generation (please refer to https://docs.deepmodeling.com/projects/dargs/en/stable/#https://), which can play a role similar to pydantic and support some custom features. Is there anything that does not meet your requirements? For the third point, dpgen2 supports multiple datasets using different lammps input templates during the configuration exploration phase. E.g.

    "configurations":   [
        {
        "type": "alloy",
        "lattice" : ["fcc", 4.57],
        "replicate" : [2, 2, 2],
        "numb_confs" : 30,
        "concentration" : [[1.0, 0.0], [0.5, 0.5], [0.0, 1.0]]
        },
        {
        "type" : "file",
        "prefix": "/file/prefix",
        "files" : ["relpath/to/confs/*"],
        "fmt" : "deepmd/npy"
        }
    ],
    "stages":   [
        [
        {
            "_comment" : "stage 0, task group 0",
            "type" : "lmp-md",
            "ensemble": "nvt", "nsteps":  50, "temps": [50, 100], "trj_freq": 10,
            "conf_idx": [0], "n_sample" : 3
        },
        {
            "_comment" : "stage 0, task group 1",
            "type" : "lmp-template",
            "lmp" : "template.lammps", "plm" : "template.plumed",
            "trj_freq" : 10, "revisions" : {"V_NSTEPS" : [40], "V_TEMP" : [150, 200]},
            "conf_idx": [0], "n_sample" : 3
        }
        ],
        [
        {
            "_comment" : "stage 1, task group 0",
            "type" : "lmp-md",
            "ensemble": "npt", "nsteps":  50, "press": [1e0], "temps": [50, 100, 200], "trj_freq": 10,
            "conf_idx": [1], "n_sample" : 3
        }
        ]
    ]

Here, you can use different LAMMPS template files for different conf_idx. Is there anything that does not meet your requirements?

link89 commented 6 months ago

Hi @zjgemi It is not only system-wise LAMMPS configuration is required, but also CP2K. You may check the detail in the pseudo configuration.

deepmodeling / dpgen2

RFC: Refactor DPGEN2 with a new design #185