[BUG] `dp change-bias` will give much large model

QuantumMisaka commented 2 weeks ago

Bug summary

From pre-trained multi-head model, dp --pt change-bias will give a model with much larger size. However, finetuen with numb_steps: 0 will have no problem:

(base) [2201110432@wm2-login01 fine2]$ ll -h
total 465M
-rw-rw-r-- 1 2201110432 2201110432   24 Nov 13 15:46 checkpoint
lrwxrwxrwx 1 2201110432 2201110432   27 Nov 13 15:36 dpa230m.pt -> DPA2_medium_28_10M_beta4.pt
-rw-rw-r-- 1 2201110432 2201110432 338M Nov 13 15:45 dpa230m_updated.pt
-rw-rw-r-- 1 2201110432 2201110432  800 Nov 13 15:46 dpa2.hdf5
-rw-rw-r-- 1 2201110432 2201110432 119M Nov 13 15:35 DPA2_medium_28_10M_beta4.pt
-rw-rw-r-- 1 2201110432 2201110432 108K Nov 13 15:46 dpfine_4279321.err
-rw-rw-r-- 1 2201110432 2201110432    0 Nov 13 15:43 dpfine_4279321.out
-rw-r--r-- 1 2201110432 2201110432  692 Nov 13 15:43 fine.slurm
-rw-rw-r-- 1 2201110432 2201110432 2.4K Nov 13 15:36 input.json
-rw-rw-r-- 1 2201110432 2201110432 3.0K Nov 13 15:45 input_v2_compat.json
-rw-rw-r-- 1 2201110432 2201110432    0 Nov 13 15:46 lcurve.out
-rw-rw-r-- 1 2201110432 2201110432 7.9M Nov 13 15:46 model_finetune.ckpt-0.pt
lrwxrwxrwx 1 2201110432 2201110432   24 Nov 13 15:46 model_finetune.ckpt.pt -> model_finetune.ckpt-0.pt
-rw-rw-r-- 1 2201110432 2201110432 4.8K Nov 13 15:45 out.json

the model after change-bias dpa230m_updated.pt have much larger size even more than original model, but the 0-step finetuned model model_finetune.ckpt-0.pt have much small size which is in desire.

And, if try to load the model after change-bias, the head should be selected, which is also not in desire

In [1]: from deepmd.infer.deep_pot import DeepPot

In [2]: model = DeepPot("dpa230m_updated.pt")
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
/data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py:110: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(model_file, map_location=env.DEVICE)
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 model = DeepPot("dpa230m_updated.pt")

File /data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/infer/deep_eval.py:334, in DeepEval.__init__(self, model_file, auto_batch_size, neighbor_list, *args, **kwargs)
    326 def __init__(
    327     self,
    328     model_file: str,
   (...)
    332     **kwargs: Any,
    333 ) -> None:
--> 334     self.deep_eval = DeepEvalBackend(
    335         model_file,
    336         self.output_def,
    337         *args,
    338         auto_batch_size=auto_batch_size,
    339         neighbor_list=neighbor_list,
    340         **kwargs,
    341     )
    342     if self.deep_eval.get_has_spin() and hasattr(self, "output_def_mag"):
    343         self.deep_eval.output_def = self.output_def_mag

File /data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py:121, in DeepEval.__init__(self, model_file, output_def, auto_batch_size, neighbor_list, head, *args, **kwargs)
    118 if isinstance(head, int):
    119     head = model_keys[0]
    120 assert (
--> 121     head is not None
    122 ), f"Head must be set for multitask model! Available heads are: {model_keys}"
    123 assert (
    124     head in model_keys
    125 ), f"No head named {head} in model! Available heads are: {model_keys}"
    126 self.input_param = self.input_param["model_dict"][head]

AssertionError: Head must be set for multitask model! Available heads are: ['Domains_Alloy', 'Domains_Anode', 'Domains_Cluster', 'Domains_Drug', 'Domains_FerroEle', 'Domains_OC2M', 'Domains_SSE-PBE', 'Domains_SemiCond', 'H2O_H2O-PD', 'Metals_AgAu-PBE', 'Metals_AlMgCu', 'Metals_Cu', 'Metals_Sn', 'Metals_Ti', 'Metals_V', 'Metals_W', 'Others_C12H26', 'Others_HfO2', 'Domains_ANI', 'Domains_SSE-PBESol', 'Domains_Transition1x', 'H2O_H2O-DPLR', 'H2O_H2O-PBE0TS-MD', 'H2O_H2O-PBE0TS', 'H2O_H2O-SCAN0', 'Metals_AgAu-PBED3', 'Others_In2Se3', 'MP_traj_v024_alldata_mixu']

Where the 0-step finetuned model have no problem

In [3]: model = DeepPot("model_finetune.ckpt-0.pt")
/data/softwares/miniconda3/envs/deepmd-3b4/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py:110: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(model_file, map_location=env.DEVICE)
You can use the environment variable DP_INFER_BATCH_SIZE tocontrol the inference batch size (nframes * natoms). The default value is 1024.

DeePMD-kit Version

v3.0.0b4

Backend and its version

pytorch 2.5.1

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

command for change-bias:

dp --pt change-bias dpa230m.pt -s ../../data-clean4_radsp/train --model-branch Domains_OC2M

command for 0-step finetune

dp --pt train input.json --finetune dpa230m.pt --model-branch Domains_OC2M

coresponding input.json

{
  "_comment": "that's all",
  "model": {
    "type_map": [
      "C",
      "Fe",
      "H",
      "O"
    ],
    "descriptor": {
      "type": "dpa2",
      "repinit": {
        "tebd_dim": 8,
        "rcut": 6.0,
        "rcut_smth": 0.5,
        "nsel": 120,
        "neuron": [
          25,
          50,
          100
        ],
        "axis_neuron": 12,
        "activation_function": "tanh",
        "three_body_sel": 40,
        "three_body_rcut": 4.0,
        "three_body_rcut_smth": 3.5,
        "use_three_body": true
      },
      "repformer": {
        "rcut": 4.0,
        "rcut_smth": 3.5,
        "nsel": 40,
        "nlayers": 6,
        "g1_dim": 128,
        "g2_dim": 32,
        "attn2_hidden": 32,
        "attn2_nhead": 4,
        "attn1_hidden": 128,
        "attn1_nhead": 4,
        "axis_neuron": 4,
        "update_h2": false,
        "update_g1_has_conv": true,
        "update_g1_has_grrg": true,
        "update_g1_has_drrd": true,
        "update_g1_has_attn": false,
        "update_g2_has_g1g1": false,
        "update_g2_has_attn": true,
        "update_style": "res_residual",
        "update_residual": 0.01,
        "update_residual_init": "norm",
        "attn2_has_gate": true,
        "use_sqrt_nnei": true,
        "g1_out_conv": true,
        "g1_out_mlp": true
      },
      "add_tebd_to_repinit_out": false
    },
    "fitting_net": {
      "neuron": [
        240,
        240,
        240
      ],
      "resnet_dt": true,
      "seed": 19090,
      "_comment": " that's all"
    },
    "_comment": " that's all"
  },
  "learning_rate": {
    "type": "exp",
    "decay_steps": 2000,
    "start_lr": 0.001,
    "stop_lr": 3.51e-08,
    "_comment": "that's all"
  },
  "loss": {
    "type": "ener",
    "start_pref_e": 0.02,
    "limit_pref_e": 1,
    "start_pref_f": 1000,
    "limit_pref_f": 1,
    "start_pref_v": 0,
    "limit_pref_v": 0,
    "_comment": " that's all"
  },
  "training": {
    "stat_file": "./dpa2.hdf5",
    "training_data": {
      "systems": "../../data-clean4_radsp/train/",
      "batch_size": "auto",
      "_comment": "that's all"
    },
    "numb_steps": 0,
    "warmup_steps": 0,
    "gradient_max_norm": 5.0,
    "max_ckpt_keep":20,
    "seed": 19090,
    "save_ckpt": "model_finetune.ckpt",
    "disp_file": "lcurve.out",
    "disp_freq": 1000,
    "save_freq": 20000,
    "_comment": "that's all"
  }
}

Steps to Reproduce

run these command in any dataset

Further Information, Files, and Links

No response

iProzd commented 2 weeks ago

Finetune with numb_steps: 0 will save only one head model, while change-bias will keep multi-head model, which is expected.

QuantumMisaka commented 2 weeks ago

Finetune with numb_steps: 0 will save only one head model, while change-bias will keep multi-head model, which is expected.

Thanks! However the oversize of change-bias model is also a problem

njzjz commented 2 weeks ago

@QuantumMisaka could you post all keys in the checkpoint

import torch

def get_all_keys(d, prefix=""):
    """Gets all keys from a nested dictionary with slash-separated paths."""
    keys = []
    for k, v in d.items():
        if isinstance(v, dict):
            keys.extend(get_all_keys(v, prefix + str(k) + "/"))
        else:
            keys.append(prefix + str(k))
    return keys

print(get_all_keys(torch.load("dpa230m.pt")))
print(get_all_keys(torch.load("dpa230m_updated.pt")))

QuantumMisaka commented 2 weeks ago

@njzjz They print the same results

(base) [2201110432@wm2-data01 fine2]$ diff allkeys_base.txt allkeys_cbias.txt 
(base) [2201110432@wm2-data01 fine2]$

allkeys_base.txt allkeys_cbias.txt

njzjz commented 2 weeks ago

The reason should be the abuse of deepcopy

https://github.com/deepmodeling/deepmd-kit/blob/058e0665c6da87577a5f69b9eff2057088934cd6/deepmd/pt/entrypoints/main.py#L394

（or the copy of tensors that happens in other places)

deepmodeling / deepmd-kit