deepmodeling / deepmd-kit

A deep learning package for many-body potential energy representation and molecular dynamics
https://docs.deepmodeling.com/projects/deepmd/
GNU Lesser General Public License v3.0
1.45k stars 500 forks source link

[BUG] the model cannot be restart after finetuning #3986

Closed kaiafeng0815 closed 2 weeks ago

kaiafeng0815 commented 1 month ago

Bug summary

I am using the 2024Q1 version to train the DPA2 model and found that the finetuned model cannot be restarted for further training, whereas the model trained from scratch can be restarted. This limits the use of finetuning.

DeePMD-kit Version

2024q1

Backend and its version

TensorFlow v2.15.0.

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

input_torch_multitask_test.json Running Commands: dp --pt train input_torch_multitask_easy.json --finetune 2024Q1.pt dp --pt train input_torch_multitask_easy.json --restart model.ckpt.pt Error Log.md

Steps to Reproduce

Finetune the DPA2 model with multitask and then restart the ckpt. Data is casual and can be swapped out for any data set.

Further Information, Files, and Links

No response

njzjz commented 1 month ago

In the 2024Q1 branch, the parameters from 2024Q1.pt will override those in input_torch_multitask_easy.json. Please use the generated out.json instead of input_torch_multitask_easy.json:

dp --pt train out.json --restart model.ckpt.pt
kaiafeng0815 commented 1 month ago

After running dp --pt train out.json --restart model.ckpt.pt, I still encounter errors related to parameters. error log2.md

njzjz commented 1 month ago

After running dp --pt train out.json --restart model.ckpt.pt, I still encounter errors related to parameters.

error log2.md

cc @iProzd

iProzd commented 1 month ago

@kaiafeng0815 It's kind of confusing but let me try to explain the situation:

  1. Please make sure you know you are doing multi-task finetuning, which is different from traditional finetuning (single-task). While you did not add finetune_head to pureknachar in model_dict, which will use randomly initialized fitting net. See here for details.

  2. For finetuning process in 2024Q1 branch, the model parameters indeed will be overrided with those in pretrained model, and will not be saved in out.json. This was fixed in devel branch. Anyway, you can use the following code to extract the model params in pretrained.pt:

    import torch
    model_state = torch.load('pretrained.pt')
    model_param = model_state['model']['_extra_state']['model_params']['shared_dict']
    print(model_param)

    Then you can manually replace the model params in the shared_dict of your input_torch_multitask_test.json and can do restarting.

  3. You can also use the devel branch once after the pretrained model is released (soon). But with notice: in devel branch, if user does not know the model params in the pretrained model, one must add --use-pretrain-script after finetuning command, see here for details. Model params will be saved into out.json. After that, when you are doing traditional finetuning (single-task), you can use out.json to do restarting. But notice when you are doing multi-task finetuning like this, you can not directly use out.json. Instead, you must copy the same model param in shared_dict from out.json to your input. Because the multi-task input must have the unfilled model_dict instead of filled one in out.json, which can not be used to generate sharing rules for multi-task.