Error in online data loading for large datasets

Hi,

We are trying to follow the following description for loading large datasets https://github.com/ACEsuit/mace?tab=readme-ov-file#on-line-data-loading-for-large-datasets

The script we ran is python run_train.py \ --name="H5-test" \ --train_file="./processed_data/train/" \ --valid_file="./processed_data/val/" \ --test_dir="./processed_data/test/" \ --statistics_file="./processed_data/statistics.json" \ --model="MACE" \ --num_interactions=2 \ --num_channels=128 \ --max_L=2 \ --correlation=3 \ --num_workers=4 \ --batch_size=4 \ --valid_batch_size=4 \ --max_num_epochs=1500 \ --swa \ --start_swa=1200 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --error_table='PerAtomMAE' \ --loss='huber' \ --E0s="average" \ --scaling='rms_forces_scaling' \ --lr=0.01 \ --device=cuda \ --seed=123 \ --keep_checkpoints \ --save_cpu \

However, we met the error:

2024-06-12 11:35:41.818 INFO: MACE version: 0.3.5 2024-06-12 11:35:41.818 INFO: Configuration: Namespace(config=None, name='H5-test', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float64', distributed=False, log_level='INFO', error_table='PerAtomMAE', model='MACE', r_max=5.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o', num_channels=128, max_L=2, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=False, compute_forces=True, train_file='./processed_data/train/', valid_file='./processed_data/val/', valid_fraction=0.1, test_file=None, test_dir='./processed_data/test/', multi_processed_test=False, num_workers=4, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file='./processed_data/statistics.json', E0s='average', keep_isolated_atoms=False, energy_key='energy', forces_key='forces', virials_key='virials', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='huber', forces_weight=100.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', batch_size=4, valid_batch_size=4, lr=0.01, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=True, start_swa=1200, ema=True, ema_decay=0.99, max_num_epochs=1500, patience=2048, foundation_model=None, foundation_model_readout=True, eval_interval=2, keep_checkpoints=True, save_all_checkpoints=False, restart_latest=False, save_cpu=True, clip_grad=10.0, wandb=False, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight']) 2024-06-12 11:35:42.076 INFO: CUDA version: 11.8, CUDA device: 0 2024-06-12 11:35:42.145 INFO: Error accessing Git repository: /home/xqliu/mace-Na-Na3SbS4-20240612 2024-06-12 11:35:42.145 INFO: Using statistics json file 2024-06-12 11:35:42.146 INFO: Using atomic numbers from statistics file 2024-06-12 11:35:42.146 INFO: AtomicNumberTable: (11, 16, 51) 2024-06-12 11:35:42.146 INFO: Atomic Energies not in training file, using command line argument E0s 2024-06-12 11:35:42.146 INFO: Atomic energies: [-1.0677830771638828, -4.123435243331502, -8.679403149310547] 2024-06-12 11:35:42.154 INFO: WeightedHuberEnergyForcesStressLoss(energy_weight=1.000, forces_weight=100.000, stress_weight=1.000) 2024-06-12 11:35:42.155 INFO: Average number of neighbors: 14.610021909077963 2024-06-12 11:35:42.155 INFO: Selected the following outputs: {'energy': True, 'forces': True, 'virials': True, 'stress': True, 'dipoles': False} 2024-06-12 11:35:42.155 INFO: Building model 2024-06-12 11:35:42.156 INFO: Hidden irreps: 128x0e+128x1o+128x2e /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute. warnings.warn( /home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/jit/_check.py:177: UserWarning: The TorchScript type system doesn't support instance-level annotations on empty non-base types ininit. Instead, either 1) use a type annotation in the class body, or 2) wrap the type intorch.jit.Attribute`. warnings.warn( 2024-06-12 11:35:44.544 INFO: Using stochastic weight averaging (after 1200 epochs) with energy weight : 1000.0, forces weight : 100.0 and learning rate : 0.001 2024-06-12 11:35:44.614 INFO: ScaleShiftMACE( (node_embedding): LinearNodeEmbeddingBlock( (linear): Linear(3x0e -> 128x0e | 384 weights) ) (radial_embedding): RadialEmbeddingBlock( (bessel_fn): BesselBasis(r_max=5.0, num_basis=8, trainable=False) (cutoff_fn): PolynomialCutoff(p=5.0, r_max=5.0) ) (spherical_harmonics): SphericalHarmonics() (atomic_energies_fn): AtomicEnergiesBlock(energies=[-1.0678, -4.1234, -8.6794]) (interactions): ModuleList( (0): RealAgnosticInteractionBlock( (linear_up): Linear(128x0e -> 128x0e | 16384 weights) (conv_tp): TensorProduct(128x0e x 1x0e+1x1o+1x2e+1x3o -> 128x0e+128x1o+128x2e+128x3o | 512 paths | 512 weights) (conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 512] (linear): Linear(128x0e+128x1o+128x2e+128x3o -> 128x0e+128x1o+128x2e+128x3o | 65536 weights) (skip_tp): FullyConnectedTensorProduct(128x0e+128x1o+128x2e+128x3o x 3x0e -> 128x0e+128x1o+128x2e+128x3o | 196608 paths | 196608 weights) (reshape): reshape_irreps() ) (1): RealAgnosticResidualInteractionBlock( (linear_up): Linear(128x0e+128x1o+128x2e -> 128x0e+128x1o+128x2e | 49152 weights) (conv_tp): TensorProduct(128x0e+128x1o+128x2e x 1x0e+1x1o+1x2e+1x3o -> 384x0e+640x1o+640x2e+512x3o | 2176 paths | 2176 weights) (conv_tp_weights): FullyConnectedNet[8, 64, 64, 64, 2176] (linear): Linear(384x0e+640x1o+640x2e+512x3o -> 128x0e+128x1o+128x2e+128x3o | 278528 weights) (skip_tp): FullyConnectedTensorProduct(128x0e+128x1o+128x2e x 3x0e -> 128x0e | 49152 paths | 49152 weights) (reshape): reshape_irreps() ) ) (products): ModuleList( (0): EquivariantProductBasisBlock( (symmetric_contractions): SymmetricContraction( (contractions): ModuleList( (0): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x4x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) (1): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x6x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) (2): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x7x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) ) ) (linear): Linear(128x0e+128x1o+128x2e -> 128x0e+128x1o+128x2e | 49152 weights) ) (1): EquivariantProductBasisBlock( (symmetric_contractions): SymmetricContraction( (contractions): ModuleList( (0): Contraction( (contractions_weighting): ModuleList( (0-1): 2 x GraphModule() ) (contractions_features): ModuleList( (0-1): 2 x GraphModule() ) (weights): ParameterList( (0): Parameter containing: [torch.float64 of size 3x4x128 (cuda:0)] (1): Parameter containing: [torch.float64 of size 3x1x128 (cuda:0)] ) (graph_opt_main): GraphModule() ) ) ) (linear): Linear(128x0e -> 128x0e | 16384 weights) ) ) (readouts): ModuleList( (0): LinearReadoutBlock( (linear): Linear(128x0e+128x1o+128x2e -> 1x0e | 128 weights) ) (1): NonLinearReadoutBlock( (linear_1): Linear(128x0e -> 16x0e | 2048 weights) (non_linearity): Activation [x] (16x0e -> 16x0e) (linear_2): Linear(16x0e -> 1x0e | 16 weights) ) ) (scale_shift): ScaleShiftBlock(scale=0.432809, shift=0.000000) ) 2024-06-12 11:35:44.619 INFO: Number of parameters: 984720 2024-06-12 11:35:44.620 INFO: Optimizer: Adam ( Parameter Group 0 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: embedding swa_lr: 0.001 weight_decay: 0.0

Parameter Group 1 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: interactions_decay swa_lr: 0.001 weight_decay: 5e-07

Parameter Group 2 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: interactions_no_decay swa_lr: 0.001 weight_decay: 0.0

Parameter Group 3 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: products swa_lr: 0.001 weight_decay: 5e-07

Parameter Group 4 amsgrad: True betas: (0.9, 0.999) capturable: False differentiable: False eps: 1e-08 foreach: None fused: None initial_lr: 0.01 lr: 0.01 maximize: False name: readouts swa_lr: 0.001 weight_decay: 0.0 ) 2024-06-12 11:35:44.620 INFO: Using gradient clipping with tolerance=10.000 2024-06-12 11:35:44.620 INFO: Started training 2024-06-12 11:35:58.874 INFO: Epoch None: loss=0.0737, RMSE_E_per_atom=117.9 meV, RMSE_F=420.1 meV / A, RMSE_stress_per_atom=0.0 meV / A^3 Traceback (most recent call last): File "/home/xqliu/mace-Na-Na3SbS4-20240612/run_train.py", line 6, in main() File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/cli/run_train.py", line 51, in main run(args) File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/cli/run_train.py", line 687, in run tools.train( File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 179, in train train_one_epoch( File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 289, in train_oneepoch , opt_metrics = take_step( ^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/tools/train.py", line 319, in take_step output = model( ^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/models.py", line 369, in forward node_feats = product( ^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/blocks.py", line 226, in forward node_feats = self.symmetric_contractions(node_feats, node_attrs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/symmetric_contraction.py", line 82, in forward outs = [contraction(x, y) for contraction in self.contractions] ^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/mace/modules/symmetric_contraction.py", line 222, in forward c_tensor = contract_weights( ^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 737, in call_wrapped return self._wrapped_call(self, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 317, in call raise e File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/fx/graph_module.py", line 304, in call return super(self.cls, obj).call(*args, kwargs) # type: ignore[misc] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File ".111", line 6, in forward File "/home/xqliu/anaconda3/envs/mace_gpu/lib/python3.12/site-packages/torch/functional.py", line 1213, in tensordot return _VF.tensordot(a, b, dims_a, dims_b) # type: ignore[attr-defined] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: unknown error CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.`

We are not clear on what might be causing this, and if you have any idea, we would be grateful for that information.

ACEsuit / mace

Error in online data loading for large datasets #465