ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
554 stars 205 forks source link

Mat shape missing match for `Multihead` fine-tune #615

Closed MengnanCui closed 2 weeks ago

MengnanCui commented 1 month ago

Descirbe the bug

Hi, I want to do multihead finetuning on personal pre-trained model(ptbp_model.modelbased on version 0.3.7, main branch), after editing these commands

I got error messages as the following, do you have ideas about this problem? Do I have to use the latest code to pre-train a model, then fine-tune with multihead approach? for your information, the pretrained model based on dftb_ key, then finetuning on dft_.

Start Training with MACE:{'seed': 2747, 'training': 'training.xyz', 'validation': '../../fixed_validation.xyz', 'test': '../../fixed_test.xyz', 'config_type_weights': '{"Default":1.0}', 'E0s': {74: -11.022250868182281}, 'model': 'MACE', 'hidden_irreps': '128x0e + 128x1o', 'r_max': 6.0, 'batch_size': 16, 'valid_batch_size': 16, 'max_num_epochs': 1000, 'start_swa': 750, 'energy_key': 'dft_energy', 'forces_key': 'dft_forces', 'default_dtype': 'float64', 'patience': 500, 'device': 'cuda', 'multitask': False, 'distributed': False, 'exc_path': 'mace_run_train', 'foundation_model': False}
2024-10-01 07:30:05.662 INFO: ===========VERIFYING SETTINGS===========
2024-10-01 07:30:05.662 INFO: MACE version: 0.3.7
2024-10-01 07:30:05.723 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-01 07:30:06.318 INFO: Using foundation model ../ptbp_model.model as initial checkpoint.
2024-10-01 07:30:06.319 INFO: ===========LOADING INPUT DATA===========
2024-10-01 07:30:06.319 INFO: Using heads: ['default']
2024-10-01 07:30:06.319 INFO: =============    Processing head default     ===========
2024-10-01 07:30:06.381 INFO: Training set [100 configs, 100 energy, 4761 forces] loaded from 'training.xyz'
2024-10-01 07:30:06.686 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../fixed_validation.xyz'
2024-10-01 07:30:06.990 INFO: Test set (1000 configs) loaded from '../../fixed_test.xyz':
2024-10-01 07:30:06.991 INFO: Default_Default: 1000 configs, 1000 energy, 46560 forces
2024-10-01 07:30:06.991 INFO: Total number of configurations: train=100, valid=1000, tests=[Default_Default: 1000],
2024-10-01 07:30:06.991 INFO: ==================Using multiheads finetuning mode==================
2024-10-01 07:30:06.991 INFO: Using foundation model for multiheads finetuning with ../../../transferability7k/training.xyz
2024-10-01 07:30:09.239 INFO: Training set [7642 configs, 7642 energy, 380589 forces] loaded from '../../../transferability7k/training.xyz'
2024-10-01 07:30:09.722 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../../transferability7k/validation.xyz'
2024-10-01 07:30:09.722 INFO: Total number of configurations: train=7642, valid=1000
2024-10-01 07:30:09.755 INFO: Atomic Numbers used: [74]
2024-10-01 07:30:09.756 INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
2024-10-01 07:30:09.757 INFO: Atomic Energies used (z: eV) for head default: {74: -11.022250868182281}
2024-10-01 07:30:09.760 INFO: Atomic Energies used (z: eV) for head pt_head: {74: -29.330717613489064}
2024-10-01 07:30:17.415 INFO: Average number of neighbors: 57.10205972318726
2024-10-01 07:30:17.416 INFO: During training the following quantities will be reported: energy, forces, virials, stress
2024-10-01 07:30:17.416 INFO: ===========MODEL DETAILS===========
Traceback (most recent call last):
  File "/home/mncui/software/miniconda3/envs/mace_foundation/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/work/home/mncui/software/mace_main09_2024/mace/cli/run_train.py", line 62, in main
    run(args)
  File "/work/home/mncui/software/mace_main09_2024/mace/cli/run_train.py", line 501, in run
    model, output_args = configure_model(args, train_loader, atomic_energies, model_foundation, heads, z_table)
  File "/work/home/mncui/software/mace_main09_2024/mace/tools/model_script_utils.py", line 37, in configure_model
    args.mean, args.std = modules.scaling_classes[args.scaling](
  File "/work/home/mncui/software/mace_main09_2024/mace/modules/utils.py", line 312, in compute_mean_rms_energy_forces
    node_e0 = atomic_energies_fn(batch.node_attrs)
  File "/home/mncui/software/miniconda3/envs/mace_foundation/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/home/mncui/software/mace_main09_2024/mace/modules/blocks.py", line 160, in forward
    return torch.matmul(x, torch.atleast_2d(self.atomic_energies).T)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (229x1 and 2x1)
Training finished!

Here is the log file MACE_model_run-2747_debug.log

ilyes319 commented 1 month ago

Hello, The multihead finetuning is not yet supported for other models than MP pretrained models. Hopefully I can fix that soon. For now please use the normal finetuning.

MengnanCui commented 1 month ago

Ok, thanks for your reply, looking foward to know updates.

ilyes319 commented 1 month ago

@MengnanCui Can you test again with the latest main? I should have fixed that.

MengnanCui commented 1 month ago

Great, Thank you! I will try it!

MengnanCui commented 1 month ago

Hi, @ilyes319 thank so much for your efforts.

(1) I tried the latest main branch, with the same setting as all above, it still outputs this error while finetuning.

2024-10-03 08:32:13.385 INFO: ===========VERIFYING SETTINGS===========
2024-10-03 08:32:13.386 INFO: MACE version: 0.3.7
2024-10-03 08:32:13.453 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-03 08:32:14.229 INFO: Using foundation model ../ptbp_model.model as initial checkpoint.
2024-10-03 08:32:14.230 INFO: ===========LOADING INPUT DATA===========
2024-10-03 08:32:14.230 INFO: Using heads: ['default']
2024-10-03 08:32:14.231 INFO: =============    Processing head default     ===========
2024-10-03 08:32:14.300 INFO: Training set [100 configs, 100 energy, 4761 forces] loaded from 'training.xyz'
2024-10-03 08:32:14.628 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../fixed_validation.xyz'
2024-10-03 08:32:14.946 INFO: Test set (1000 configs) loaded from '../../fixed_test.xyz':
2024-10-03 08:32:14.947 INFO: Default_Default: 1000 configs, 1000 energy, 46560 forces
2024-10-03 08:32:14.947 INFO: Total number of configurations: train=100, valid=1000, tests=[Default_Default: 1000],
2024-10-03 08:32:14.948 INFO: ==================Using multiheads finetuning mode==================
2024-10-03 08:32:14.948 INFO: Using foundation model for multiheads finetuning with ../../../transferability7k/training.xyz
2024-10-03 08:32:17.246 INFO: Training set [7642 configs, 7642 energy, 380589 forces] loaded from '../../../transferability7k/training.xyz'
2024-10-03 08:32:17.776 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../../transferability7k/validation.xyz'
2024-10-03 08:32:17.776 INFO: Total number of configurations: train=7642, valid=1000
2024-10-03 08:32:17.817 INFO: Atomic Numbers used: [74]
2024-10-03 08:32:17.817 INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
2024-10-03 08:32:17.823 INFO: Atomic Energies used (z: eV) for head default: {74: -11.022250868182281}
2024-10-03 08:32:17.823 INFO: Atomic Energies used (z: eV) for head pt_head: {74: -29.330717613489064}
2024-10-03 08:32:26.050 INFO: Average number of neighbors: 57.10205972318726
2024-10-03 08:32:26.051 INFO: During training the following quantities will be reported: energy, forces, virials, stress
2024-10-03 08:32:26.051 INFO: ===========MODEL DETAILS===========
Traceback (most recent call last):
  File "/home/mncui/software/miniconda3/envs/mace_foundation/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 63, in main
    run(args)
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 505, in run
    model, output_args = configure_model(args, train_loader, atomic_energies, model_foundation, heads, z_table)
  File "/work/home/mncui/software/mace_main10_2024/mace/tools/model_script_utils.py", line 37, in configure_model
    args.mean, args.std = modules.scaling_classes[args.scaling](
  File "/work/home/mncui/software/mace_main10_2024/mace/modules/utils.py", line 312, in compute_mean_rms_energy_forces
    node_e0 = atomic_energies_fn(batch.node_attrs)
  File "/home/mncui/software/miniconda3/envs/mace_foundation/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/work/home/mncui/software/mace_main10_2024/mace/modules/blocks.py", line 160, in forward
    return torch.matmul(x, torch.atleast_2d(self.atomic_energies).T)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (421x1 and 2x1)

MACE_model_run-2024_debug.log

(2) On the other hand & for your information, to exclude the effects of mace version.(the ../ptbp_model.model i used above was based on a code at least 3/4 month ago.) Therefore, I did a new training with the latest main branch, got ./train_main/MACE_model.model, then multihead finetunning based on it, there is a different error message as the following:

2024-10-03 10:05:17.508 INFO: ===========VERIFYING SETTINGS===========
2024-10-03 10:05:17.508 INFO: MACE version: 0.3.7
2024-10-03 10:05:17.570 INFO: CUDA version: 11.8, CUDA device: 0
2024-10-03 10:05:18.256 INFO: Using foundation model ./train_main/MACE_model.model as initial checkpoint.
2024-10-03 10:05:18.257 INFO: ===========LOADING INPUT DATA===========
2024-10-03 10:05:18.257 INFO: Using heads: ['default']
2024-10-03 10:05:18.257 INFO: =============    Processing head default     ===========
2024-10-03 10:05:18.320 INFO: Training set [100 configs, 100 energy, 4761 forces] loaded from 'training.xyz'
2024-10-03 10:05:18.624 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../fixed_validation.xyz'
2024-10-03 10:05:18.925 INFO: Test set (1000 configs) loaded from '../../fixed_test.xyz':
2024-10-03 10:05:18.926 INFO: Default_Default: 1000 configs, 1000 energy, 46560 forces
2024-10-03 10:05:18.927 INFO: Total number of configurations: train=100, valid=1000, tests=[Default_Default: 1000],
2024-10-03 10:05:18.927 INFO: ==================Using multiheads finetuning mode==================
2024-10-03 10:05:18.928 INFO: Using foundation model for multiheads finetuning with ../../../transferability7k/training.xyz
2024-10-03 10:05:21.214 INFO: Training set [7642 configs, 7642 energy, 380589 forces] loaded from '../../../transferability7k/training.xyz'
2024-10-03 10:05:21.689 INFO: Validation set [1000 configs, 1000 energy, 46593 forces] loaded from '../../../transferability7k/validation.xyz'
2024-10-03 10:05:21.689 INFO: Total number of configurations: train=7642, valid=1000
2024-10-03 10:05:21.719 INFO: Atomic Numbers used: [74]
2024-10-03 10:05:21.720 INFO: Isolated Atomic Energies (E0s) not in training file, using command line argument
Traceback (most recent call last):
  File "/home/mncui/software/miniconda3/envs/mace_foundation/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 63, in main
    run(args)
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 356, in run
    atomic_energies_dict[head_config.head_name] = {
  File "/work/home/mncui/software/mace_main10_2024/mace/cli/run_train.py", line 357, in <dictcomp>
    z: model_foundation.atomic_energies_fn.atomic_energies[
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

MACE_model_newrun-2024_debug.log

Thanks again and hope these info can help.

ilyes319 commented 1 month ago

Could send your input script, a small sample of your data and your model at ib467@cam.ac.uk so I can reproduce that myself. Also how are you parsing your E0s?

MengnanCui commented 1 month ago

Hi, hope the email fine you, the E0s were set inside the script, there is only one element in the datasets "Tungsten"

MengnanCui commented 1 month ago

by the way, the E0s for pretrained models are set all the same in the input script but from DFTB calculation {74: -29.330717613489064}, as you can find, there are dftb_ tagged energy&forces inside all the data as well.

gabor1 commented 1 month ago

the E0s need to be calculated with the same method as the data you are fitting

MengnanCui commented 1 month ago

Yes, that's so I have dftb E0s for pretraining on dftb, the dft E0s for finetuning on dft.

ilyes319 commented 3 weeks ago

@MengnanCui I should have fixed that in the main branch. Could you try and tell me if it is fixed indeed.

MengnanCui commented 2 weeks ago

Hi, Ilyes. I submitted a job, it worked very well until now. The bug is fixed. Thank you very much!