Closed mjhong0708 closed 9 months ago
Hey,
I see that you need to start using the official training script of MACE. I highly recommend you use the official training script, as we use different regularizations essential to MACE performance.
We are not experiencing the same thing with our trainer. On the slowdown, I suspect it might be that you are caching information on your GPU, and it is accumulating. I don't think it is coming from MACE.
First of all, thanks for developers to share such a great model to community.
Description
When I tried to train
MACE
model on large bulk dataset(containing DFT optimization trajectories fromMaterials Project
), I experienced slowdown of training over epochs. The slowdown occurs regardless on batch size, dataloader shuffling, etc. It does not happen with small molecule datasets (ex.MD17
)Code and data for reproduction
Below is full script to reproduce my problem. To exclude any possible source of error other than the model itself, I wrote minimal energy-only training loop.
I also attach the training dataset (
train_dataset.traj
) andatomic_energies.json
.Screenshots
After some hundreds of epochs, the time for epochs increases more than 2x.
Running environment:
torch=2.0.1 (CUDA 11.7)
e3nn=0.4.4
ase=3.22.1
numpy=1.25.1
0.2.0
(commit hash: 42659c3aa84f3318e568343261b4a8635fce0166)What could be the reason for problem? I'm curious if the origin of this problem is
e3nn
ormace
. Thanks in advance!