2024-06-12 11:35:41.818 INFO: MACE version: 0.3.5 2024-06-12 11:35:41.818 INFO: Configuration: Namespace(config=None, name='H5-test', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float64', distributed=False, log_level='INFO', error_table='PerAtomMAE', model='MACE', r_max=5.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o', num_channels=128, max_L=2, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=False, compute_forces=True, train_file='./processed_data/train/', valid_file='./processed_data/val/', valid_fraction=0.1, test_file=None, test_dir='./processed_data/test/', multi_processed_test=False, num_workers=4, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file='./processed_data/statistics.json', E0s='average', keep_isolated_atoms=False, energy_key='energy', forces_key='forces', virials_key='virials', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='huber', forces_weight=100.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', batch_size=4, valid_batch_size=4, lr=0.01, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=True, start_swa=1200, ema=True, ema_decay=0.99, max_num_epochs=1500, patience=2048, foundation_model=None, foundation_model_readout=True, eval_interval=2, keep_checkpoints=True, save_all_checkpoints=False, restart_latest=False, save_cpu=True, clip_grad=10.0, wandb=False, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight']) 2024-06-12 11:35:42.076 INFO: CUDA version: 11.8, CUDA device: 0 2024-06-12 11:35:42.145 INFO: Error accessing Git repository: /home/xqliu/mace-Na-Na3SbS4-20240612 2024-06-12 11:35:42.145 INFO: Using statistics json file 2024-06-12 11:35:42.146 INFO: Using atomic numbers from statistics file 2024-06-12 11:35:42.146 INFO: AtomicNumberTable: (11, 16, 51) 2024-06-12 11:35:42.146 INFO: Atomic Energies not in training file, using command line argument E0s 2024-06-12 11:35:42.146 INFO: Atomic energies: [-1.0677830771638828, -4.123435243331502, -8.679403149310547] 2024-06-12 11:35:42.154 INFO: WeightedHuberEnergyForcesStressLoss(energy_weight=1.000, forces_weight=100.000, stress_weight=1.000) 2024-06-12 11:35:42.155 INFO: Average number of neighbors: 14.610021909077963 2024-06-12 11:35:42.155 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': True, 'stress': True, 'dipoles': False} 2024-06-12 11:35:42.155 INFO: Building model 2024-06-12 11:35:42.156 INFO: Hidden irreps: 128x0e+128x1o+128x2e /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute`.
warnings.warn(
2024-06-12 11:35:44.544 INFO: Using stochastic weight averaging (after 1200 epochs) with energy weight : 1000.0, forces weight : 100.0 and learning rate : 0.001
2024-06-12 11:35:44.614 INFO: ScaleShiftMACE(
(node_embedding): LinearNodeEmbeddingBlock(
(linear): Linear(3x0e -> 128x0e | 384 weights)
)
(radial_embedding): RadialEmbeddingBlock(
(bessel_fn): BesselBasis(r_max=5.0, num_basis=8, trainable=False)
(cutoff_fn): PolynomialCutoff(p=5.0, r_max=5.0)
)
(spherical_harmonics): SphericalHarmonics()
(atomic_energies_fn): AtomicEnergiesBlock(energies=[-1.0678, -4.1234, -8.6794])
(interactions): ModuleList(
(0): RealAgnosticInteractionBlock(
(linear_up): Linear(128x0e -> 128x0e | 16384 weights)
(conv_tp): TensorProduct(128x0e x 1x0e+1x1o+1x2e+1x3o -> 128x0e+128x1o+128x2e+128x3o | 512 paths | 512 weights)
(conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 512]
(linear): Linear(128x0e+128x1o+128x2e+128x3o -> 128x0e+128x1o+128x2e+128x3o | 65536 weights)
(skip_tp): FullyConnectedTensorProduct(128x0e+128x1o+128x2e+128x3o x 3x0e -> 128x0e+128x1o+128x2e+128x3o | 196608 paths | 196608 weights)
(reshape): reshape_irreps()
)
(1): RealAgnosticResidualInteractionBlock(
(linear_up): Linear(128x0e+128x1o+128x2e -> 128x0e+128x1o+128x2e | 49152 weights)
(conv_tp): TensorProduct(128x0e+128x1o+128x2e x 1x0e+1x1o+1x2e+1x3o -> 384x0e+640x1o+640x2e+512x3o | 2176 paths | 2176 weights)
(conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 2176]
(linear): Linear(384x0e+640x1o+640x2e+512x3o -> 128x0e+128x1o+128x2e+128x3o | 278528 weights)
(skip_tp): FullyConnectedTensorProduct(128x0e+128x1o+128x2e x 3x0e -> 128x0e | 49152 paths | 49152 weights)
(reshape): reshape_irreps()
)
)
(products): ModuleList(
(0): EquivariantProductBasisBlock(
(symmetric_contractions): SymmetricContraction(
(contractions): ModuleList(
(0): Contraction(
(contractions_weighting): ModuleList(
(0-1): 2 x GraphModule()
)
(contractions_features): ModuleList(
(0-1): 2 x GraphModule()
)
(weights): ParameterList(
(0): Parameter containing: [torch.float64 of size 3x4x128 (cuda:0)]
(1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)]
)
(graph_opt_main): GraphModule()
)
(1): Contraction(
(contractions_weighting): ModuleList(
(0-1): 2 x GraphModule()
)
(contractions_features): ModuleList(
(0-1): 2 x GraphModule()
)
(weights): ParameterList(
(0): Parameter containing: [torch.float64 of size 3x6x128 (cuda:0)]
(1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)]
)
(graph_opt_main): GraphModule()
)
(2): Contraction(
(contractions_weighting): ModuleList(
(0-1): 2 x GraphModule()
)
(contractions_features): ModuleList(
(0-1): 2 x GraphModule()
)
(weights): ParameterList(
(0): Parameter containing: [torch.float64 of size 3x7x128 (cuda:0)]
(1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)]
)
(graph_opt_main): GraphModule()
)
)
)
(linear): Linear(128x0e+128x1o+128x2e -> 128x0e+128x1o+128x2e | 49152 weights)
)
(1): EquivariantProductBasisBlock(
(symmetric_contractions): SymmetricContraction(
(contractions): ModuleList(
(0): Contraction(
(contractions_weighting): ModuleList(
(0-1): 2 x GraphModule()
)
(contractions_features): ModuleList(
(0-1): 2 x GraphModule()
)
(weights): ParameterList(
(0): Parameter containing: [torch.float64 of size 3x4x128 (cuda:0)]
(1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)]
)
(graph_opt_main): GraphModule()
)
)
)
(linear): Linear(128x0e -> 128x0e | 16384 weights)
)
)
(readouts): ModuleList(
(0): LinearReadoutBlock(
(linear): Linear(128x0e+128x1o+128x2e -> 1x0e | 128 weights)
)
(1): NonLinearReadoutBlock(
(linear_1): Linear(128x0e -> 16x0e | 2048 weights)
(non_linearity): Activation [x] (16x0e -> 16x0e)
(linear_2): Linear(16x0e -> 1x0e | 16 weights)
)
)
(scale_shift): ScaleShiftBlock(scale=0.432809, shift=0.000000)
)
2024-06-12 11:35:44.619 INFO: Number of parameters: 984720
2024-06-12 11:35:44.620 INFO: Optimizer: Adam (
Parameter Group 0
amsgrad: True
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
initial_lr: 0.01
lr: 0.01
maximize: False
name: embedding
swa_lr: 0.001
weight_decay: 0.0
Parameter Group 4
amsgrad: True
betas: (0.9, 0.999)
capturable: False
differentiable: False
eps: 1e-08
foreach: None
fused: None
initial_lr: 0.01
lr: 0.01
maximize: False
name: readouts
swa_lr: 0.001
weight_decay: 0.0
)
2024-06-12 11:35:44.620 INFO: Using gradient clipping with tolerance=10.000
2024-06-12 11:35:44.620 INFO: Started training
2024-06-12 11:35:58.874 INFO: Epoch None: loss=0.0737, RMSE_E_per_atom=117.9 meV, RMSE_F=420.1 meV / A, RMSE_stress_per_atom=0.0 meV / A^3
Traceback (most recent call last):
File "/home/xqliu/mace-Na-Na3SbS4-20240612/run_train.py", line 6, in
main()
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/cli/run_train.py", line 51, in main
run(args)
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/cli/run_train.py", line 687, in run
tools.train(
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 179, in train
train_one_epoch(
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 289, in train_oneepoch
, opt_metrics = take_step(
^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 319, in take_step
output = model(
^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/models.py", line 369, in forward
node_feats = product(
^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/blocks.py", line 226, in forward
node_feats = self.symmetric_contractions(node_feats, node_attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/symmetric_contraction.py", line 82, in forward
outs = [contraction(x, y) for contraction in self.contractions]
^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/symmetric_contraction.py", line 222, in forward
c_tensor = contract_weights(
^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 737, in call_wrapped
return self._wrapped_call(self, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 317, in call
raise e
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 304, in call
return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".111", line 6, in forward
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/functional.py", line 1213, in tensordot
return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.`
We are not clear on what might be causing this, and if you have any idea, we would be grateful for that information.
Hi,
We are trying to follow the following description for loading large datasets https://github.com/ACEsuit/mace?tab=readme-ov-file#on-line-data-loading-for-large-datasets
The script we ran is
python run_train.py \ --name="H5-test" \ --train_file="./processed_data/train/" \ --valid_file="./processed_data/val/" \ --test_dir="./processed_data/test/" \ --statistics_file="./processed_data/statistics.json" \ --model="MACE" \ --num_interactions=2 \ --num_channels=128 \ --max_L=2 \ --correlation=3 \ --num_workers=4 \ --batch_size=4 \ --valid_batch_size=4 \ --max_num_epochs=1500 \ --swa \ --start_swa=1200 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --error_table='PerAtomMAE' \ --loss='huber' \ --E0s="average" \ --scaling='rms_forces_scaling' \ --lr=0.01 \ --device=cuda \ --seed=123 \ --keep_checkpoints \ --save_cpu \
However, we met the error:
2024-06-12 11:35:41.818 INFO: MACE version: 0.3.5 2024-06-12 11:35:41.818 INFO: Configuration: Namespace(config=None, name='H5-test', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float64', distributed=False, log_level='INFO', error_table='PerAtomMAE', model='MACE', r_max=5.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o', num_channels=128, max_L=2, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=False, compute_forces=True, train_file='./processed_data/train/', valid_file='./processed_data/val/', valid_fraction=0.1, test_file=None, test_dir='./processed_data/test/', multi_processed_test=False, num_workers=4, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file='./processed_data/statistics.json', E0s='average', keep_isolated_atoms=False, energy_key='energy', forces_key='forces', virials_key='virials', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='huber', forces_weight=100.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', batch_size=4, valid_batch_size=4, lr=0.01, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=True, start_swa=1200, ema=True, ema_decay=0.99, max_num_epochs=1500, patience=2048, foundation_model=None, foundation_model_readout=True, eval_interval=2, keep_checkpoints=True, save_all_checkpoints=False, restart_latest=False, save_cpu=True, clip_grad=10.0, wandb=False, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight']) 2024-06-12 11:35:42.076 INFO: CUDA version: 11.8, CUDA device: 0 2024-06-12 11:35:42.145 INFO: Error accessing Git repository: /home/xqliu/mace-Na-Na3SbS4-20240612 2024-06-12 11:35:42.145 INFO: Using statistics json file 2024-06-12 11:35:42.146 INFO: Using atomic numbers from statistics file 2024-06-12 11:35:42.146 INFO: AtomicNumberTable: (11, 16, 51) 2024-06-12 11:35:42.146 INFO: Atomic Energies not in training file, using command line argument E0s 2024-06-12 11:35:42.146 INFO: Atomic energies: [-1.0677830771638828, -4.123435243331502, -8.679403149310547] 2024-06-12 11:35:42.154 INFO: WeightedHuberEnergyForcesStressLoss(energy_weight=1.000, forces_weight=100.000, stress_weight=1.000) 2024-06-12 11:35:42.155 INFO: Average number of neighbors: 14.610021909077963 2024-06-12 11:35:42.155 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': True, 'stress': True, 'dipoles': False} 2024-06-12 11:35:42.155 INFO: Building model 2024-06-12 11:35:42.156 INFO: Hidden irreps: 128x0e+128x1o+128x2e /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types in
init. Instead, either 1) use a type annotation in the class body, or 2) wrap the type in
torch.jit.Attribute`. warnings.warn( 2024-06-12 11:35:44.544 INFO: Using stochastic weight averaging (after 1200 epochs) with energy weight : 1000.0, forces weight : 100.0 and learning rate : 0.001 2024-06-12 11:35:44.614 INFO: ScaleShiftMACE( (node_embedding): LinearNodeEmbeddingBlock( (linear): Linear(3x0e -> 128x0e | 384 weights) ) (radial_embedding): RadialEmbeddingBlock( (bessel_fn): BesselBasis(r_max=5.0, num_basis=8, trainable=False) (cutoff_fn): PolynomialCutoff(p=5.0, r_max=5.0) ) (spherical_harmonics): SphericalHarmonics() (atomic_energies_fn): AtomicEnergiesBlock(energies=[-1.0678, -4.1234, -8.6794]) (interactions): ModuleList( (0): RealAgnosticInteractionBlock( (linear_up): Linear(128x0e -> 128x0e | 16384 weights) (conv_tp): TensorProduct(128x0e x 1x0e+1x1o+1x2e+1x3o -> 128x0e+128x1o+128x2e+128x3o | 512 paths | 512 weights) (conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 512] (linear): Linear(128x0e+128x1o+128x2e+128x3o -> 128x0e+128x1o+128x2e+128x3o | 65536 weights) (skip_tp): FullyConnectedTensorProduct(128x0e+128x1o+128x2e+128x3o x 3x0e -> 128x0e+128x1o+128x2e+128x3o | 196608 paths | 196608 weights) (reshape): reshape_irreps() ) (1): RealAgnosticResidualInteractionBlock( (linear_up): Linear(128x0e+128x1o+128x2e -> 128x0e+128x1o+128x2e | 49152 weights) (conv_tp): TensorProduct(128x0e+128x1o+128x2e x 1x0e+1x1o+1x2e+1x3o -> 384x0e+640x1o+640x2e+512x3o | 2176 paths | 2176 weights) (conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 2176] (linear): Linear(384x0e+640x1o+640x2e+512x3o -> 128x0e+128x1o+128x2e+128x3o | 278528 weights) (skip_tp): FullyConnectedTensorProduct(128x0e+128x1o+128x2e x 3x0e -> 128x0e | 49152 paths | 49152 weights) (reshape): reshape_irreps() ) ) (products): ModuleList( (0): EquivariantProductBasisBlock( (symmetric_contractions): SymmetricContraction( (contractions): ModuleList( (0): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x4x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) (1): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x6x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) (2): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x7x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) ) ) (linear): Linear(128x0e+128x1o+128x2e -> 128x0e+128x1o+128x2e | 49152 weights) ) (1): EquivariantProductBasisBlock( (symmetric_contractions): SymmetricContraction( (contractions): ModuleList( (0): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x4x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) ) ) (linear): Linear(128x0e -> 128x0e | 16384 weights) ) ) (readouts): ModuleList( (0): LinearReadoutBlock( (linear): Linear(128x0e+128x1o+128x2e -> 1x0e | 128 weights) ) (1): NonLinearReadoutBlock( (linear_1): Linear(128x0e -> 16x0e | 2048 weights) (non_linearity): Activation [x] (16x0e -> 16x0e) (linear_2): Linear(16x0e -> 1x0e | 16 weights) ) ) (scale_shift): ScaleShiftBlock(scale=0.432809, shift=0.000000) ) 2024-06-12 11:35:44.619 INFO: Number of parameters: 984720 2024-06-12 11:35:44.620 INFO: Optimizer: Adam ( Parameter Group 0 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: embedding swa_lr: 0.001 weight_decay: 0.0Parameter Group 1 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: interactions_decay swa_lr: 0.001 weight_decay: 5e-07
Parameter Group 2 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: interactions_no_decay swa_lr: 0.001 weight_decay: 0.0
Parameter Group 3 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: products swa_lr: 0.001 weight_decay: 5e-07
Parameter Group 4 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: readouts swa_lr: 0.001 weight_decay: 0.0 ) 2024-06-12 11:35:44.620 INFO: Using gradient clipping with tolerance=10.000 2024-06-12 11:35:44.620 INFO: Started training 2024-06-12 11:35:58.874 INFO: Epoch None: loss=0.0737, RMSE_E_per_atom=117.9 meV, RMSE_F=420.1 meV / A, RMSE_stress_per_atom=0.0 meV / A^3 Traceback (most recent call last): File "/home/xqliu/mace-Na-Na3SbS4-20240612/run_train.py", line 6, in
main()
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/cli/run_train.py", line 51, in main
run(args)
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/cli/run_train.py", line 687, in run
tools.train(
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 179, in train
train_one_epoch(
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 289, in train_oneepoch
, opt_metrics = take_step(
^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 319, in take_step
output = model(
^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/models.py", line 369, in forward
node_feats = product(
^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/blocks.py", line 226, in forward
node_feats = self.symmetric_contractions(node_feats, node_attrs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/symmetric_contraction.py", line 82, in forward
outs = [contraction(x, y) for contraction in self.contractions]
^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/symmetric_contraction.py", line 222, in forward
c_tensor = contract_weights(
^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 737, in call_wrapped
return self._wrapped_call(self, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 317, in call
raise e
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 304, in call
return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File ".111", line 6, in forward
File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/functional.py", line 1213, in tensordot
return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.`We are not clear on what might be causing this, and if you have any idea, we would be grateful for that information.