Open chenggoj opened 2 weeks ago
Have you set train_backend
to pytorch
? Note this option has not been released in a stable version.
Have you set
train_backend
topytorch
? Note this option has not been released in a stable version.
Oh, I did not notice that before. Now, I know,
def _get_model_suffix(jdata) -> str:
"""Return the model suffix based on the backend."""
mlp_engine = jdata.get("mlp_engine", "dp")
if mlp_engine == "dp":
suffix_map = {"tensorflow": ".pb", "pytorch": ".pth"}
backend = jdata.get("train_backend", "tensorflow")
if backend in suffix_map:
suffix = suffix_map[backend]
else:
raise ValueError(
f"The backend {backend} is not available. Supported backends are: 'tensorflow', 'pytorch'."
)
return suffix
else:
raise ValueError(f"Unsupported engine: {mlp_engine}")
Now, I set it. { "type_map": ["Al","O", "Pt"], "mass_map": [27,16,195], "init_data_prefix": "../", "init_data_sys": ["init/data/data_SA", "init/data/data_NP", "init/data/data_mix", "init/data/data_NP_gamma-Al2O3_001" ], "sys_configs_prefix": "../", "sys_configs": [ ["init/model_devi/POSCAR_SA"], ["init/model_devi/POSCAR_NP"], ["init/model_devi/POSCAR_mix"], ["init/model_devi/POSCAR_gamma-Al2O3_001"] ], "_comment": " that's all ", "numb_models": 4, "train_backend": "pytorch", "default_training_param": { .......
But it is still not working. The same error FileNotFoundError: cannot find download file ........frozen_model.pb
Which commit of DP-GEN do you use?
Bug summary
Dear DeePMD community,
I'm encountering an issue while using the DP-GEN workflow with the DPA-2 model and PyTorch backend. Here are the details:
Environment:
In my machine.json file, I'm using parallel training with the following command: "
command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt
" The training phase completes successfully for all four models. Each model directory contains the expected output files, including "*_task_tag_finished
" and "frozen_model.pth
".├── 000 │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── f74eaa2be2cab187505b354f787e5e5530d141f4_task_tag_finished │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/000/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log ├── 001 │ ├── 84f1c8acd2f9dc640b2fea97f8aad68396a0fc93_task_tag_finished │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/001/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log ├── 002 │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── dpdispatcher.log │ ├── e193485d0db3952cdb32f6406c9580c43f010989_task_tag_finished │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/002/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log ├── 003 │ ├── 19f28cb5828301f7434aaed206c3956f6890eb78_task_tag_finished │ ├── checkpoint │ ├── dpa2.hdf5 │ ├── frozen_model.pth │ ├── input.json -> /expanse/projects/qstore/mia344/cjiang1/NNPs-traing-DeepMD-kit/DP-GEN/run_test_DPA2_2/iter.000000/00.train/003/input.json │ ├── input_v2_compat.json │ ├── lcurve.out │ ├── model.ckpt-100.pt │ ├── model.ckpt-200.pt │ ├── model.ckpt-300.pt │ ├── model.ckpt.pt -> model.ckpt-300.pt │ ├── out.json │ └── train.log
However, the workflow stops at the model_devi stage with the following error:
FileNotFoundError: cannot find download file
frozen_model.pb` I believe DP-GEN is looking for "
frozen_model.pb" (TensorFlow format) by default, but it's not compatible with the PyTorch model "
frozen_model.pth". When I manually attempt to convert the format using: dp convert-backend frozen_model.pth frozen_model.pb I receive another error:
RuntimeError: Unknown descriptor type: dpa2. Did you mean: dpa1?` Analysis: It appears that the DPA-2 model currently only supports PyTorch and cannot be converted to the TensorFlow format (frozen_model.pb). This prevents me from proceeding with subsequent DP-GEN operations for the DPA-2 model.Questions:
Is there a way to configure DP-GEN to work with PyTorch's "frozen_model.pth" for the DPA-2 model? Are there plans to support TensorFlow backend or format conversion for the DPA-2 model in future releases? Is there an alternative workflow or workaround to use the DPA-2 model with DP-GEN?
Any guidance or suggestions would be greatly appreciated. Thank you for your time and assistance.
DeePMD-kit Version
3.0.0b4
Backend and its version
Pytorch 2.1.2
How did you download the software?
conda
Input Files, Running Commands, Error Log, etc.
machine.json
"command": "torchrun --nnodes=1 --nproc_per_node=auto dp --pt",
Steps to Reproduce
Use DPA-2 model in DP-GEN.
Further Information, Files, and Links
No response