cannot turn off multihead finetuning in multi-head-interface branch

bernstei commented 3 weeks ago

There are at least two things wrong with --multiheads-finetuning in the multi-head-interface branch. The first is that having type=bool doesn't work by itself, so you can't actually turn it off. Could be fixed by adding a special parser, or by changing the logic to --no_multiheads_finetuning combined with dest="multiheads_finetuning" and action="store_false" (and drop the type and default args).

The other is that even after you do something so the rest of the code receives args.multiheads_finetuning == False, it fails with

Traceback (most recent call last):
  File "/home/cluster/bernstei/.local/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/cli/run_train.py", line 357, in main
    atomic_energies = dict_to_array(atomic_energies_dict, args.heads)
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/tools/scripts_utils.py", line 257, in dict_to_array
    head_index = heads.index(head_name)
AttributeError: 'NoneType' object has no attribute 'index'

bernstei commented 3 weeks ago

It's easy to fix the issue in dict_to_array that causes the error above (just add if heads is not None else 0), but there are also other places that rely on heads not being None, e.g.

Traceback (most recent call last):
  File "/home/cluster/bernstei/.local/bin/mace_run_train", line 8, in <module>                                                                              
    sys.exit(main())
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/cli/run_train.py", line 630, in main                                                 
    model = modules.ScaleShiftMACE(**model_config)
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 321, in __init__                                            
    super().__init__(**kwargs)
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 137, in __init__                                            
    LinearReadoutBlock(hidden_irreps, o3.Irreps(f"{len(heads)}x0e"))
TypeError: object of type 'NoneType' has no len()

bernstei commented 3 weeks ago

I initially thought what's missing was that args.heads needs to be set to the value of heads when the former is None but the latter is read from the xyz file (right after https://github.com/ACEsuit/mace/blob/a386d997ae5675a9129d87cd53e2ce435871510a/mace/cli/run_train.py#L183), because the rest of the code is random about whether it uses args.heads or `heads.

However, while this fixes some issues (like the dict_to_array error above), it then crashes with an error that's probably also some wrong data structure, but it's much more opaque

Traceback (most recent call last):
  File "/home/cluster/bernstei/.local/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/cli/run_train.py", line 631, in main
    model = modules.ScaleShiftMACE(**model_config)
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 322, in __init__
    self.scale_shift = ScaleShiftBlock(
  File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/blocks.py", line 767, in __init__
    torch.atleast_1d(torch.tensor(shift, dtype=torch.get_default_dtype())),
TypeError: len() of unsized object

bernstei commented 3 weeks ago

It appears that you cannot pass a list of length one to torch.tensor. You can pass a plain float (like the main branch does), or a list with more than one element (like this branch does when there's more than 1 head). But if there's one head, you have to extract the float from the list of length 1 before you pass it in modules/blocks.py (line ~767 in the error message above)

bernstei commented 3 weeks ago

Pending someone else looking at whether it makes sense, this patch seems to get things running for me

diff --git a/mace/cli/run_train.py b/mace/cli/run_train.py
index e29120b..31ff28f 100644
--- a/mace/cli/run_train.py
+++ b/mace/cli/run_train.py
@@ -181,6 +181,7 @@ def main() -> None:
             logging.info(
                 "Using heads extracted from data files," f" heads used: {heads}"
             )
+            args.heads = heads

         if args.multiheads_finetuning:
             logging.info("Using multiheads finetuning mode")
@@ -627,6 +628,20 @@ def main() -> None:
             heads=heads,
         )
     elif args.model == "FoundationMACE":
+        try:
+            # torch.tensor seems to object to list of length 1 that contains a float
+            if len(model_config["atomic_inter_scale"]) == 1: #NB
+                model_config["atomic_inter_scale"] = model_config["atomic_inter_scale"][0]
+        except TypeError:
+            # len(float) fails
+            pass
+        try:
+            # torch.tensor seems to object to list of length 1 that contains a float
+            if len(model_config["atomic_inter_shift"]) == 1: #NB
+                model_config["atomic_inter_shift"] = model_config["atomic_inter_shift"][0]
+        except TypeError:
+            # len(float) fails
+            pass
         model = modules.ScaleShiftMACE(**model_config)
     elif args.model == "ScaleShiftBOTNet":
         model = modules.ScaleShiftBOTNet(

ilyes319 commented 3 weeks ago

I am changing the whole branch and fixing stuff to merge in the main. Hopefully all these will be fixed there.

ilyes319 commented 3 weeks ago

@bernstei Can you retry what you were retrying using the multihead-merge.

bernstei commented 3 weeks ago

Yes, I'll do that today

ilyes319 commented 3 weeks ago

Also if you could try loading a model in lammps using the same branch it would be nice.

bernstei commented 3 weeks ago

I can try the full cycle – fit a non-fine-tuned model, convert it to LAMMPS format, and run it in lammps. Should a multihead fine-tuned model also export to a lammps-compatible form in this branch, or only non-multihead models?

ilyes319 commented 3 weeks ago

You should be able to directly export the final model to a lammps-compatible one. I will have to fix some small bugs probably once you tell me how it crashed.

bernstei commented 3 weeks ago

Issues so far

tinyurl is blocked here, please use standard URLs. That's also good for the user to know what they're getting.

git?

2024-08-23 08:57:03.422 INFO: Error accessing Git repository: Failed to initialize: Bad git executable.
The git executable must be specified in one of the following ways:
    - be included in your $PATH
    - be set via $GIT_PYTHON_GIT_EXECUTABLE
    - explicitly set via git.refresh(<full-path-to-git-executable>)

All git commands will error until this is rectified.

This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
    - quiet|q|silence|s|silent|none|n|0: for no message or exception
    - warn|w|warning|log|l|1: for a warning message (logging level CRITICAL, displayed by default)
    - error|e|exception|raise|r|2: for a raised exception

Example:
    export GIT_PYTHON_REFRESH=quiet

Whether or not I disable multihead, I get an error related to the E0s

Traceback (most recent call last):
  File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/test_multihead_merge/../mace/cli/run_train.py", line 987, in <module>
    main()
  File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/test_multihead_merge/../mace/cli/run_train.py", line 60, in main
    run(args)
  File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/test_multihead_merge/../mace/cli/run_train.py", line 491, in run
    args.mean, args.std = modules.scaling_classes[args.scaling](
  File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/mace/modules/utils.py", line 303, in compute_mean_rms_energy_forces
    atomic_energies_fn = AtomicEnergiesBlock(atomic_energies=atomic_energies)
  File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/mace/modules/blocks.py", line 153, in __init__
    torch.tensor(atomic_energies, dtype=torch.get_default_dtype()),
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

this is consistent with the fact that stdout reports no atomic energies, despite the fact that I have the correct configurations with config_type=IsolatedAtom in the --train_file

I can't easily go any further until the E0 issue is solver.

ilyes319 commented 3 weeks ago

Can you show me what kind of input are you using?

bernstei commented 3 weeks ago

What do you mean by "what kind of input"? It's an xyz file, same as always, with REF_.... keys.

2024-08-23 08:57:02.954 INFO: Configuration: Namespace(config=None, name='MACE_no_multihead', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float64', distributed=False, log_level='INFO', error_table='PerAtomRMSEstressvirials', model='MACE', r_max=5.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o', num_channels=None, max_L=None, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=True, compute_forces=True, train_file='_no_multihead.fit.xyz', valid_file='_no_multihead.valid.xyz', valid_fraction=0.1, test_file=None, test_dir=None, multi_processed_test=False, num_workers=0, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file=None, E0s=None, foundation_filter_elements=True, heads=None, multiheads_finetuning=False, weight_pt_head=1.0, num_samples_pt=1000, subselect_pt='random', keep_isolated_atoms=False, energy_key='REF_energy', forces_key='REF_forces', virials_key='virials', stress_key='REF_stress', dipole_key='dipole', charges_key='charges', loss='universal', forces_weight=156.25, swa_forces_weight=156.25, energy_weight=2500.0, swa_energy_weight=25000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=44.44444444444445, swa_stress_weight=444.4444444444445, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', beta=0.9, batch_size=16, valid_batch_size=32, lr=0.001, swa_lr=0.0002, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=True, start_swa=30, ema=True, ema_decay=0.99, max_num_epochs=60, patience=2048, foundation_model='small', foundation_model_readout=True, eval_interval=1, keep_checkpoints=False, save_all_checkpoints=False, restart_latest=True, save_cpu=True, clip_grad=10.0, wandb=False, wandb_dir=None, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight'])

bernstei commented 3 weeks ago

Explicitly specified input flags are

python3 ../mace/cli/run_train.py \\
    --foundation_model="{args.foundation_model}" \\
    --batch_size=16 \\
    --valid_batch_size=32 \\
    --lr=0.001 \\
    --loss={args.loss} \\
    --compute_stress=True \\
    --max_num_epochs=60 \\
    --energy_weight={args.weights[0]} \\
    --forces_weight={args.weights[1]} \\
    --stress_weight={args.weights[2]} \\
    --swa \\
    --start_swa=30 \\
    --swa_lr=0.0002 \\
    --swa_energy_weight={10 * args.weights[0]} \\
    --swa_forces_weight={1  * args.weights[1]} \\
    --swa_stress_weight={10 * args.weights[2]} \\
    --ema \\
    --ema_decay=0.99 \\
    --amsgrad \\
    --error_table=PerAtomRMSEstressvirials \\
    --default_dtype=float64 \\
    --restart_latest \\
    --device={args.device} \\
    --eval_interval=1 \\
    --save_cpu \\
    --energy_key=REF_energy --forces_key=REF_forces --stress_key=REF_stress \\
    --train_file=_{args.label}.fit.xyz \\
    --valid_file=_{args.label}.valid.xyz \\
    --name="MACE_{args.label}" """

plus --multiheads_finetuning=f

ilyes319 commented 3 weeks ago

I wanted to see the inputs you used so I can reproduce it.

bernstei commented 3 weeks ago

Disabling multihead and exporting the resulting model to lammps works fine on multihead-merge. Still waiting for the multihead fit to finish so I can test lammps export.

bernstei commented 3 weeks ago

Exporting the multihead model to lammps and running it works (multihead-merge) without error. I haven't confirmed that it gives the right answer, though.

Also, note that the errors reporting during fitting with this branch seem unreasonable. Discussion is off-line right now, will come back here if it's useful.

ilyes319 commented 1 week ago

closing this as it is supported in develop now.

ACEsuit / mace

cannot turn off multihead finetuning in multi-head-interface branch #559