[BUG] Issue with DP-GEN workflow for DPA-2 model using PyTorch backend

chenggoj commented 2 weeks ago

Bug summary

Dear DeePMD community,

I'm encountering an issue while using the DP-GEN workflow with the DPA-2 model and PyTorch backend. Here are the details:

Environment:

DeePMD-kit version: 3.0.0b4-GPU-py3.9-cuda120
Model: DPA-2
Backend: PyTorch
Workflow control: DP-GEN
Issue Description:

In my machine.json file, I'm using parallel training with the following command: "command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt" The training phase completes successfully for all four models. Each model directory contains the expected output files, including "*_task_tag_finished" and "frozen_model.pth".

├── 000 │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── f74eaa2be2cab187505b354f787e5e5530d141f4_task_tag_finished │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/000/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log ├── 001 │ ├── 84f1c8acd2f9dc640b2fea97f8aad68396a0fc93_task_tag_finished │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/001/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log ├── 002 │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── dpdispatcher.log │ ├── e193485d0db3952cdb32f6406c9580c43f010989_task_tag_finished │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/002/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log ├── 003 │ ├── 19f28cb5828301f7434aaed206c3956f6890eb78_task_tag_finished │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/003/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log

However, the workflow stops at the model_devi stage with the following error: FileNotFoundError: cannot find download filefrozen_model.pb` I believe DP-GEN is looking for "frozen_model.pb" (TensorFlow format) by default, but it's not compatible with the PyTorch model "frozen_model.pth". When I manually attempt to convert the format using: dp convert-backend frozen_model.pth frozen_model.pb I receive another error:RuntimeError: Unknown descriptor type: dpa2. Did you mean: dpa1?` Analysis: It appears that the DPA-2 model currently only supports PyTorch and cannot be converted to the TensorFlow format (frozen_model.pb). This prevents me from proceeding with subsequent DP-GEN operations for the DPA-2 model.

Questions:

Is there a way to configure DP-GEN to work with PyTorch's "frozen_model.pth" for the DPA-2 model? Are there plans to support TensorFlow backend or format conversion for the DPA-2 model in future releases? Is there an alternative workflow or workaround to use the DPA-2 model with DP-GEN?

Any guidance or suggestions would be greatly appreciated. Thank you for your time and assistance.

DeePMD-kit Version

3.0.0b4

Backend and its version

Pytorch 2.1.2

How did you download the software?

conda

Input Files, Running Commands, Error Log, etc.

machine.json

"command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt",

Steps to Reproduce

Use DPA-2 model in DP-GEN.

Further Information, Files, and Links

No response

njzjz commented 2 weeks ago

Have you set train_backend to pytorch? Note this option has not been released in a stable version.

chenggoj commented 2 weeks ago

Have you set train_backend to pytorch? Note this option has not been released in a stable version.

Oh, I did not notice that before. Now, I know,

def _get_model_suffix(jdata) -> str:
    """Return the model suffix based on the backend."""
    mlp_engine = jdata.get("mlp_engine", "dp")
    if mlp_engine == "dp":
        suffix_map = {"tensorflow": ".pb", "pytorch": ".pth"}
        backend = jdata.get("train_backend", "tensorflow")
        if backend in suffix_map:
            suffix = suffix_map[backend]
        else:
            raise ValueError(
                f"The backend {backend} is not available. Supported backends are: 'tensorflow', 'pytorch'."
            )
        return suffix
    else:
        raise ValueError(f"Unsupported engine: {mlp_engine}")

Now, I set it. { "type_map": ["Al","O", "Pt"], "mass_map": [27,16,195], "init_data_prefix": "../", "init_data_sys": ["init/data/data_SA", "init/data/data_NP", "init/data/data_mix", "init/data/data_NP_gamma-Al2O3_001" ], "sys_configs_prefix": "../", "sys_configs": [ ["init/model_devi/POSCAR_SA"], ["init/model_devi/POSCAR_NP"], ["init/model_devi/POSCAR_mix"], ["init/model_devi/POSCAR_gamma-Al2O3_001"] ], "_comment": " that's all ", "numb_models": 4, "train_backend": "pytorch", "default_training_param": { .......

But it is still not working. The same error FileNotFoundError: cannot find download file ........frozen_model.pb

njzjz commented 1 week ago

Which commit of DP-GEN do you use?

deepmodeling / dpgen