Closed bernstei closed 1 week ago
It's easy to fix the issue in dict_to_array
that causes the error above (just add if heads is not None else 0
), but there are also other places that rely on heads
not being None
, e.g.
Traceback (most recent call last):
File "/home/cluster/bernstei/.local/bin/mace_run_train", line 8, in <module>
sys.exit(main())
File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/cli/run_train.py", line 630, in main
model = modules.ScaleShiftMACE(**model_config)
File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 321, in __init__
super().__init__(**kwargs)
File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 137, in __init__
LinearReadoutBlock(hidden_irreps, o3.Irreps(f"{len(heads)}x0e"))
TypeError: object of type 'NoneType' has no len()
I initially thought what's missing was that args.heads
needs to be set to the value of heads
when the former is None
but the latter is read from the xyz file (right after https://github.com/ACEsuit/mace/blob/a386d997ae5675a9129d87cd53e2ce435871510a/mace/cli/run_train.py#L183), because the rest of the code is random about whether it uses args.heads
or `heads.
However, while this fixes some issues (like the dict_to_array
error above), it then crashes with an error that's probably also some wrong data structure, but it's much more opaque
Traceback (most recent call last):
File "/home/cluster/bernstei/.local/bin/mace_run_train", line 8, in <module>
sys.exit(main())
File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/cli/run_train.py", line 631, in main
model = modules.ScaleShiftMACE(**model_config)
File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/models.py", line 322, in __init__
self.scale_shift = ScaleShiftBlock(
File "/home/cluster/bernstei/.local/lib/python3.9/site-packages/mace/modules/blocks.py", line 767, in __init__
torch.atleast_1d(torch.tensor(shift, dtype=torch.get_default_dtype())),
TypeError: len() of unsized object
It appears that you cannot pass a list of length one to torch.tensor
. You can pass a plain float (like the main
branch does), or a list with more than one element (like this branch does when there's more than 1 head). But if there's one head, you have to extract the float from the list of length 1 before you pass it in modules/blocks.py
(line ~767 in the error message above)
Pending someone else looking at whether it makes sense, this patch seems to get things running for me
diff --git a/mace/cli/run_train.py b/mace/cli/run_train.py
index e29120b..31ff28f 100644
--- a/mace/cli/run_train.py
+++ b/mace/cli/run_train.py
@@ -181,6 +181,7 @@ def main() -> None:
logging.info(
"Using heads extracted from data files," f" heads used: {heads}"
)
+ args.heads = heads
if args.multiheads_finetuning:
logging.info("Using multiheads finetuning mode")
@@ -627,6 +628,20 @@ def main() -> None:
heads=heads,
)
elif args.model == "FoundationMACE":
+ try:
+ # torch.tensor seems to object to list of length 1 that contains a float
+ if len(model_config["atomic_inter_scale"]) == 1: #NB
+ model_config["atomic_inter_scale"] = model_config["atomic_inter_scale"][0]
+ except TypeError:
+ # len(float) fails
+ pass
+ try:
+ # torch.tensor seems to object to list of length 1 that contains a float
+ if len(model_config["atomic_inter_shift"]) == 1: #NB
+ model_config["atomic_inter_shift"] = model_config["atomic_inter_shift"][0]
+ except TypeError:
+ # len(float) fails
+ pass
model = modules.ScaleShiftMACE(**model_config)
elif args.model == "ScaleShiftBOTNet":
model = modules.ScaleShiftBOTNet(
I am changing the whole branch and fixing stuff to merge in the main. Hopefully all these will be fixed there.
@bernstei Can you retry what you were retrying using the multihead-merge
.
Yes, I'll do that today
Also if you could try loading a model in lammps using the same branch it would be nice.
I can try the full cycle – fit a non-fine-tuned model, convert it to LAMMPS format, and run it in lammps. Should a multihead fine-tuned model also export to a lammps-compatible form in this branch, or only non-multihead models?
You should be able to directly export the final model to a lammps-compatible one. I will have to fix some small bugs probably once you tell me how it crashed.
Issues so far
git?
2024-08-23 08:57:03.422 INFO: Error accessing Git repository: Failed to initialize: Bad git executable.
The git executable must be specified in one of the following ways:
- be included in your $PATH
- be set via $GIT_PYTHON_GIT_EXECUTABLE
- explicitly set via git.refresh(<full-path-to-git-executable>)
All git commands will error until this is rectified.
This initial message can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
- quiet|q|silence|s|silent|none|n|0: for no message or exception
- warn|w|warning|log|l|1: for a warning message (logging level CRITICAL, displayed by default)
- error|e|exception|raise|r|2: for a raised exception
Example:
export GIT_PYTHON_REFRESH=quiet
Traceback (most recent call last):
File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/test_multihead_merge/../mace/cli/run_train.py", line 987, in <module>
main()
File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/test_multihead_merge/../mace/cli/run_train.py", line 60, in main
run(args)
File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/test_multihead_merge/../mace/cli/run_train.py", line 491, in run
args.mean, args.std = modules.scaling_classes[args.scaling](
File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/mace/modules/utils.py", line 303, in compute_mean_rms_energy_forces
atomic_energies_fn = AtomicEnergiesBlock(atomic_energies=atomic_energies)
File "/home/cluster/bernstei/src/work/MACE/mace_github_develop/mace/modules/blocks.py", line 153, in __init__
torch.tensor(atomic_energies, dtype=torch.get_default_dtype()),
TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.
this is consistent with the fact that stdout reports no atomic energies, despite the fact that I have the correct configurations with config_type=IsolatedAtom
in the --train_file
I can't easily go any further until the E0 issue is solver.
Can you show me what kind of input are you using?
What do you mean by "what kind of input"? It's an xyz file, same as always, with REF_....
keys.
2024-08-23 08:57:02.954 INFO: Configuration: Namespace(config=None, name='MACE_no_multihead', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float64', distributed=False, log_level='INFO', error_table='PerAtomRMSEstressvirials', model='MACE', r_max=5.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o', num_channels=None, max_L=None, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=True, compute_forces=True, train_file='_no_multihead.fit.xyz', valid_file='_no_multihead.valid.xyz', valid_fraction=0.1, test_file=None, test_dir=None, multi_processed_test=False, num_workers=0, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file=None, E0s=None, foundation_filter_elements=True, heads=None, multiheads_finetuning=False, weight_pt_head=1.0, num_samples_pt=1000, subselect_pt='random', keep_isolated_atoms=False, energy_key='REF_energy', forces_key='REF_forces', virials_key='virials', stress_key='REF_stress', dipole_key='dipole', charges_key='charges', loss='universal', forces_weight=156.25, swa_forces_weight=156.25, energy_weight=2500.0, swa_energy_weight=25000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=44.44444444444445, swa_stress_weight=444.4444444444445, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', beta=0.9, batch_size=16, valid_batch_size=32, lr=0.001, swa_lr=0.0002, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=True, start_swa=30, ema=True, ema_decay=0.99, max_num_epochs=60, patience=2048, foundation_model='small', foundation_model_readout=True, eval_interval=1, keep_checkpoints=False, save_all_checkpoints=False, restart_latest=True, save_cpu=True, clip_grad=10.0, wandb=False, wandb_dir=None, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight'])
Explicitly specified input flags are
python3 ../mace/cli/run_train.py \\
--foundation_model="{args.foundation_model}" \\
--batch_size=16 \\
--valid_batch_size=32 \\
--lr=0.001 \\
--loss={args.loss} \\
--compute_stress=True \\
--max_num_epochs=60 \\
--energy_weight={args.weights[0]} \\
--forces_weight={args.weights[1]} \\
--stress_weight={args.weights[2]} \\
--swa \\
--start_swa=30 \\
--swa_lr=0.0002 \\
--swa_energy_weight={10 * args.weights[0]} \\
--swa_forces_weight={1 * args.weights[1]} \\
--swa_stress_weight={10 * args.weights[2]} \\
--ema \\
--ema_decay=0.99 \\
--amsgrad \\
--error_table=PerAtomRMSEstressvirials \\
--default_dtype=float64 \\
--restart_latest \\
--device={args.device} \\
--eval_interval=1 \\
--save_cpu \\
--energy_key=REF_energy --forces_key=REF_forces --stress_key=REF_stress \\
--train_file=_{args.label}.fit.xyz \\
--valid_file=_{args.label}.valid.xyz \\
--name="MACE_{args.label}" """
plus --multiheads_finetuning=f
I wanted to see the inputs you used so I can reproduce it.
Disabling multihead and exporting the resulting model to lammps works fine on multihead-merge
. Still waiting for the multihead fit to finish so I can test lammps export.
Exporting the multihead model to lammps and running it works (multihead-merge
) without error. I haven't confirmed that it gives the right answer, though.
Also, note that the errors reporting during fitting with this branch seem unreasonable. Discussion is off-line right now, will come back here if it's useful.
closing this as it is supported in develop now.
There are at least two things wrong with
--multiheads-finetuning
in themulti-head-interface
branch. The first is that havingtype=bool
doesn't work by itself, so you can't actually turn it off. Could be fixed by adding a special parser, or by changing the logic to--no_multiheads_finetuning
combined withdest="multiheads_finetuning"
andaction="store_false"
(and drop thetype
anddefault
args).The other is that even after you do something so the rest of the code receives
args.multiheads_finetuning == False
, it fails with