ACEsuit / mace

MACE - Fast and accurate machine learning interatomic potentials with higher order equivariant message passing.
Other
410 stars 152 forks source link

Error accessing Git repository #483

Open turbosonics opened 1 week ago

turbosonics commented 1 week ago

Describe the bug I compiled MACE to local GPU server cluster using virtual environment with python39, cuda 11.8 & pytorch 2.3.0.

My input script looks like this:

mace_run_train \
    --name="test5" \
    --seed=123 \
    --device=cuda \
    --default_dtype="float32" \
    --error_table="PerAtomRMSEstressvirials" \
    --model="MACE" \
    --r_max=6.0 \
    --hidden_irreps='128x0e + 128x1o + 128x2e' \
    --num_channels=32 \
    --max_L=2 \
    --compute_stress=True \
    --compute_forces=True \
    --train_file="${GEOMETRY_DIR}/${GEOMETRY_TRAIN_NAME}" \
    --valid_fraction=0.2 \
    --test_file="${GEOMETRY_DIR}/${GEOMETRY_TEST_NAME}" \
    --E0s='{3:-0.1133, 8:-0.3142, 40:-1.6277, 57:-0.4746}' \
    --energy_key="energy" \
    --forces_key="forces" \
    --stress_key="stress" \
    --virials_key="virial" \
    --loss="weighted" \
    --forces_weight=1.0 \
    --energy_weight=1.0 \
    --stress_weight=1.0 \
    --virials_weight=1.0 \
    --config_type_weights='{"Default":1.0}' \
    --optimizer="adam" \
    --batch_size=5 \
    --valid_batch_size=5 \
    --lr=0.005 \
    --amsgrad \
    --ema \
    --ema_decay=0.99 \
    --max_num_epochs=2000 \
    --patience=50 \
    --keep_checkpoints \
    --restart_latest \
    --save_cpu \
    --clip_grad=100.0 \

Then, the training job fails almost immediately. Slurm system generates following error file:

Traceback (most recent call last):
  File "/home/venv_mace_gpu_cuda118_pytorch230/bin/mace_run_train", line 8, in <module>
    sys.exit(main())
  File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 51, in main
    run(args)
  File "/home/venv_mace_gpu_cuda118_pytorch230/lib/python3.9/site-packages/mace/cli/run_train.py", line 168, in run
    assert args.train_file.endswith(".xyz"), "Must specify atomic_numbers when using .h5 train_file input"
AssertionError: Must specify atomic_numbers when using .h5 train_file input

Also, log file prints out as following:

2024-06-21 14:47:55.515 INFO: MACE version: 0.3.5
2024-06-21 14:47:55.516 INFO: Configuration: Namespace(config=None, name='test5', seed=123, log_dir='logs', model_dir='.', checkpoints_dir='checkpoints', results_dir='results', downloads_dir='downloads', device='cuda', default_dtype='float32', distributed=False, log_level='INFO', error_table='PerAtomRMSEstressvirials', model='MACE', r_max=6.0, radial_type='bessel', num_radial_basis=8, num_cutoff_basis=5, pair_repulsion=False, distance_transform='None', interaction='RealAgnosticResidualInteractionBlock', interaction_first='RealAgnosticResidualInteractionBlock', max_ell=3, correlation=3, num_interactions=2, MLP_irreps='16x0e', radial_MLP='[64, 64, 64]', hidden_irreps='128x0e + 128x1o + 128x2e', num_channels=32, max_L=2, gate='silu', scaling='rms_forces_scaling', avg_num_neighbors=1, compute_avg_num_neighbors=True, compute_stress=True, compute_forces=True, train_file='testinput.extxyz', valid_file=None, valid_fraction=0.2, test_file='testinput.extxyz', test_dir=None, multi_processed_test=False, num_workers=0, pin_memory=True, atomic_numbers=None, mean=None, std=None, statistics_file=None, E0s='{3:-0.1133, 8:-0.3142, 40:-1.6277, 57:-0.4746}', keep_isolated_atoms=False, energy_key='energy', forces_key='forces', virials_key='virial', stress_key='stress', dipole_key='dipole', charges_key='charges', loss='weighted', forces_weight=1.0, swa_forces_weight=100.0, energy_weight=1.0, swa_energy_weight=1000.0, virials_weight=1.0, swa_virials_weight=10.0, stress_weight=1.0, swa_stress_weight=10.0, dipole_weight=1.0, swa_dipole_weight=1.0, config_type_weights='{"Default":1.0}', huber_delta=0.01, optimizer='adam', beta=0.9, batch_size=5, valid_batch_size=5, lr=0.005, swa_lr=0.001, weight_decay=5e-07, amsgrad=True, scheduler='ReduceLROnPlateau', lr_factor=0.8, scheduler_patience=50, lr_scheduler_gamma=0.9993, swa=False, start_swa=None, ema=True, ema_decay=0.99, max_num_epochs=2000, patience=50, foundation_model=None, foundation_model_readout=True, eval_interval=2, keep_checkpoints=True, save_all_checkpoints=False, restart_latest=True, save_cpu=True, clip_grad=100.0, wandb=False, wandb_dir=None, wandb_project='', wandb_entity='', wandb_name='', wandb_log_hypers=['num_channels', 'max_L', 'correlation', 'lr', 'swa_lr', 'weight_decay', 'batch_size', 'max_num_epochs', 'start_swa', 'energy_weight', 'forces_weight'])
2024-06-21 14:47:55.546 INFO: CUDA version: 11.8, CUDA device: 0
2024-06-21 14:47:55.897 INFO: Error accessing Git repository: /mnt/work/MACE_training/20240621_test/01a5_test5

Is this happening because the size of the training set is huge?

bernstei commented 1 week ago

No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?

turbosonics commented 1 week ago

No, it's because the suffix .xyz is hardwired in your version of the code and you're using .extxyz. This has been fixed in #462 . @ilyes319 , what's the status of that PR?

Ha, my habit to distinguish two xyz formats brought this error.

I changed the geometry file name to *.xyz and resubmitted. Now the training job started, the crash doesn't occur.

Thank you.