Closed mhsiron closed 4 months ago
Hi - can you try with
atom_modify map yes
as mentioned here https://mace-docs.readthedocs.io/en/latest/guide/lammps.html#using-the-model-in-lammps?
I'm not sure that's it, but worth checking.
Encountering a similar issue; I've attached the error file and relevant sections from the lammps input script.
It seems to be failing consistently during a nvt run, but seems to have no issue with a npt style run. I've tried restarting, using different random seeds and starting structures but the simulation is consistently failing about 2000 steps into the NVT section of the simulation. Array.25022699_1.err.txt md_LPK.lmp.txt
Apologies for the somewhat badly commented lammps script; was just modifying an old script to quickly get some data.
@owen-rett I'm trying to understand this error message
/var/spool/slurmd/job25022700/slurm_script: line 54: 3599961 Aborted /home/gridsan/orettenmaier/Lammps_MACE/lammps/build/lmp -k on g 1 -sf kk -in md_LPK.lmp
Traceback (most recent call last):
File "MSD_K_Det.py", line 29, in <module>
K_Zr, K_Ce, K_O = msd_K_trans(Temp, Mean_msdZr, Mean_msdO)
File "MSD_K_Det.py", line 10, in msd_K_trans
if msdCe < 0.00001:
UnboundLocalError: local variable 'msdCe' referenced before assignment
Is this the root cause or something that's happening after LAMMPS fails?
Ah, sorry, the slurm script I'm using runs a initial simulation to determine lattice constants and mean squared displacements, calls a python script prepare subsequent simulations, and then performs subsequent simulationss. The error there is from the python script having a typo in it (shifting from having 3 species to 2 species). That's my fault. The quoted error is entirely unrelated to lammps.
The initial simulation still does exhibit the same error as mhsiron during a NVT (with langevin integration) section however.
Edit: For Clarity
I don't see anything wrong with your input on first pass. If it's true that the problem is happening long into an NVT simulation I'm worried it will be challenging to debug. Can you try to reduce it to a minimal example (as few LAMMPS commands as possible) that fails reliably. I can try to reproduce from that - you can email your model if you don't want to post it. Sorry, not sure what else to suggest
Hi - can you try with
atom_modify map yes
as mentioned here https://mace-docs.readthedocs.io/en/latest/guide/lammps.html#using-the-model-in-lammps?
I'm not sure that's it, but worth checking.
Hi @wcwitt same error with atom_modify map yes. However, I may have found out a bit more on what might causes the error:
It appears models trained with L>0 appear to crash. Turns out my training command had max_L=1. Only models trained with max_L=0 seem to work for my simulation. Could this a memory issue? My A100 GPU has 80GB of memory, and this seems like a relatively small simulation ~1300 atoms.
For the MACE-MP-0 medium and large models I get an explicit CUDA memory stacktrace, which I do not receive for the L0 model, but for the trained models I get the more ambiguous stack trace above.
Hi @mhsiron thanks for this. That does seem a bit small for a memory issue, but could you try with ~500 atoms just to see?
Hi @wcwitt,
I did a ~150 atom simulation with an L1 model, and it indeed works. GPU volatility is in the range of ~60-90% and: Memory usage seems to be around 22G. From this it makes sense that ~10x more atoms would make the GPU run out of memory.
My question, are the L1 models that memory intensive? Or does this point to a potential memory leak somewhere?
certainly memory requirements go up when you go from L=0 to L=1 and even more if you go to L=2
What's the density of your system (in terms of average number of neighbors during the simulation). The L=1 model should run 1K atoms on 80GB quite easily if the density is below 50.
Hi @mhsiron thanks for sticking with this - we definitely appreciate the detailed reports.
My question, are the L1 models that memory intensive? Or does this point to a potential memory leak somewhere?
I'm not sure. Like @ilyes319, I wouldn't normally expect problems on that machine with L=1, <2000 atoms. But use of the LAMMPS interface been fairly low until recently, so I'm open to all options.
If you have time, you could try launching analgous calculations from Python/ASE, just to see if the memory limitations are similar.
Hi all,
Per LAMMPS output for L1 on the <150 atom simulation:
Neighbor list info ...
update: every = 1 steps, delay = 0 steps, check = yes
max neighbors/atom: 2000, page size: 100000
master list distance cutoff = 14
ghost atom cutoff = 14
binsize = 14, bins = 2 1 2
1 neighbor lists, perpetual/occasional/extra = 1 0 0
(1) pair mace/kk, perpetual
attributes: full, newton on, kokkos_device
pair build: full/bin/kk/device
stencil: full/bin/3d
bin: kk/device
These appear to be default as I have not set any command that would change it, I have tried adding:
neighbor 10.0 bin
But it appears to be overridden? At least I get the same LAMMPS output.
I added:
neigh_modify one 50 page 2500
Will report back!
Does not seem to help. Actually the L0 simulation also prove unstable. The only stable model I can run is the pre-trained MACE-MP-0 small L0 model.
Another peculiar thing I noticed is the temperature in the log suddenly goes to quite low temperature prior to crash, example below, the columns are: timestep (ps), temperature (K), total energy (eV), pressure (bar), length in x (A), volume (A^3),density:
11.5 1458.80157385003 -8173.93823005146 -3148.14775245486 29.49934686 24560.4591488324 1.4538743316293
11.75 1375.7611038842 -8186.71536429021 2634.97363948781 29.49934686 24560.4591488324 1.4538743316293
12 1425.31323684682 -8181.87041265065 -666.217691988384 29.49934686 24560.4591488324 1.4538743316293
12.25 1431.21964335045 -8191.86786235401 -2823.12112557144 29.49934686 24560.4591488324 1.4538743316293
12.5 0.00157387697505881 -8344.35923152783 -8366.67115057049 29.49934686 24560.4591488324 1.4538743316293
12.75 1.22775118971289e-17 -8326.09875594358 -6949.76472390582 29.49934686 24560.4591488324 1.4538743316293
13 2.28493309030338e-08 -8371.61500883076 -11399.9341038866 29.49934686 24560.4591488324 1.4538743316293
13.25 8.77094472954126e-07 -8391.24708921418 -13039.2885264807 29.49934686 24560.4591488324 1.4538743316293
13.5 5.24128354736295e-06 -8404.93236755041 -13464.1034805214 29.49934686 24560.4591488324 1.4538743316293
13.75 1.67908974908941e-05 -8416.07696635189 -13365.7167324098 29.49934686 24560.4591488324 1.4538743316293
14 3.92758414519387e-05 -8426.00378684878 -13130.2069142645 29.49934686 24560.4591488324 1.4538743316293
14.25 7.64073353514439e-05 -8435.0017582836 -12967.8196410895 29.49934686 24560.4591488324 1.4538743316293
14.5 0.00013213828234377 -8443.90439332092 -12188.7756332344 29.49934686 24560.4591488324 1.4538743316293
14.75 0.000265051994964286 -8454.90567015758 -11823.8413844585 29.49934686 24560.4591488324 1.4538743316293
15 0.000267630919262236 -8462.56120861329 -11693.4464379121 29.49934686 24560.4591488324 1.4538743316293
15.25 0.000345222275526478 -8468.46583212703 -11477.1279654604 29.49934686 24560.4591488324 1.4538743316293
15.5 0.00043740866282505 -8473.63262891532 -11208.4617558759 29.49934686 24560.4591488324 1.4538743316293
15.75 0.000547741723244913 -8478.19656335514 -10935.5283521724 29.49934686 24560.4591488324 1.4538743316293
16 0.0006980055037539 -8482.3480800735 -10696.3698486076 29.49934686 24560.4591488324 1.4538743316293
That just sounds like an unstable model, and when the atoms explode the cell or some neighbor list or something becomes huge and it crashes. I've had similar problems fine-tuning MP0, although a colleague here is having better luck. In our case it seems to depend on how close our DFT parameters are to the ones that MPtrj used.
I see, thanks @bernstei, I get similar results for DFT trained L0 model, without starting from MP-0. Any recommendations for what kind of data to include to help make the model more stable, should I force some dimer vs. distance in my dataset? Or is it a problem of not training enough?
As for L>0 model with ~1000 atoms is that just unfeasible with an 80GB graphic card from memory standpoint?
I should also add with the previous example, the T drop does not necessarily lead to simulation crashing then; and for the same network, the crash does not necessarily come with a sudden T drop either. The structure doesn't indicate anything peculiar either, there is no super close atoms, there is no volume/force implosion. The same input script can lead to crashing at different time steps.
12 1425.31323684682 -8181.87041265065 -666.217691988384 29.49934686 24560.4591488324 1.4538743316293
12.25 1431.21964335045 -8191.86786235401 -2823.12112557144 29.49934686 24560.4591488324 1.4538743316293
12.5 0.00157387697505881 -8344.35923152783 -8366.67115057049 29.49934686 24560.4591488324 1.4538743316293
12.75 1.22775118971289e-17 -8326.09875594358 -6949.76472390582 29.49934686 24560.4591488324 1.4538743316293
13 2.28493309030338e-08 -8371.61500883076 -11399.9341038866 29.49934686 24560.4591488324 1.4538743316293
13.25 8.77094472954126e-07 -8391.24708921418 -13039.2885264807 29.49934686 24560.4591488324 1.4538743316293
13.5 5.24128354736295e-06 -8404.93236755041 -13464.1034805214 29.49934686 24560.4591488324 1.4538743316293
13.75 1.67908974908941e-05 -8416.07696635189 -13365.7167324098 29.49934686 24560.4591488324 1.4538743316293
14 3.92758414519387e-05 -8426.00378684878 -13130.2069142645 29.49934686 24560.4591488324 1.4538743316293
14.25 7.64073353514439e-05 -8435.0017582836 -12967.8196410895 29.49934686 24560.4591488324 1.4538743316293
14.5 0.00013213828234377 -8443.90439332092 -12188.7756332344 29.49934686 24560.4591488324 1.4538743316293
14.75 0.000265051994964286 -8454.90567015758 -11823.8413844585 29.49934686 24560.4591488324 1.4538743316293
15 0.000267630919262236 -8462.56120861329 -11693.4464379121 29.49934686 24560.4591488324 1.4538743316293
15.25 0.000345222275526478 -8468.46583212703 -11477.1279654604 29.49934686 24560.4591488324 1.4538743316293
15.5 0.00043740866282505 -8473.63262891532 -11208.4617558759 29.49934686 24560.4591488324 1.4538743316293
15.75 0.000547741723244913 -8478.19656335514 -10935.5283521724 29.49934686 24560.4591488324 1.4538743316293
16 0.0006980055037539 -8482.3480800735 -10696.3698486076 29.49934686 24560.4591488324 1.4538743316293
16.25 0.000901826930070773 -8486.26240725677 -10498.6184681285 29.49934686 24560.4591488324 1.4538743316293
16.5 0.00117301642049218 -8490.03795700827 -10331.5334883053 29.49934686 24560.4591488324 1.4538743316293
16.75 0.00154166627110766 -8493.73999867024 -10180.7036119053 29.49934686 24560.4591488324 1.4538743316293
17 0.00206806583224074 -8497.43861375988 -10033.2779155224 29.49934686 24560.4591488324 1.4538743316293
17.25 0.00287840323482185 -8501.22983009579 -9877.7939081323 29.49934686 24560.4591488324 1.4538743316293
17.5 0.00427271873252964 -8505.27391749629 -9704.80646541264 29.49934686 24560.4591488324 1.4538743316293
17.75 0.00663896018741823 -8509.7471013976 -9471.37126467432 29.49934686 24560.4591488324 1.4538743316293
18 0.0112570350367217 -8514.67546145644 -9144.47861455316 29.49934686 24560.4591488324 1.4538743316293
18.25 0.0275419175723079 -8520.64779098596 -8747.62193215947 29.49934686 24560.4591488324 1.4538743316293
18.5 0.298230169317817 -8531.21976964688 -8155.84901776477 29.49934686 24560.4591488324 1.4538743316293
18.75 1377.42958346478 -8149.90292963912 3299.29676770027 29.49934686 24560.4591488324 1.4538743316293
19 1349.90744584849 -8148.02735171268 -3364.58806787548 29.49934686 24560.4591488324 1.4538743316293
19.25 1431.3736665173 -8147.73228027619 1283.91641343253 29.49934686 24560.4591488324 1.4538743316293
19.5 1388.11414468739 -8160.03829785877 -1504.08681731051 29.49934686 24560.4591488324 1.4538743316293
19.75 1404.50542270096 -8173.47840745554 -3058.49880175371 29.49934686 24560.4591488324 1.4538743316293
20 1457.27936083213 -8167.73656168868 2996.79414644363 29.49934686 24560.4591488324 1.4538743316293
"Same" ... "different time steps": same seed for things like random initial velocities, or are you not being quite that precise when you say "same"?
The initial drop in T happens very fast, and then reverses very fast, yet the total energy goes down gradually, then jumps up. You should plot T (or kinetic energy) and potential energy for every time step (and preferably also save the trajectory), to see in detail what's happening during that T drop. Dropping from T= 1400 K to << 1 K seems essentially impossible to me, just for thermodynamic/stat mech reasons. You can get a T drop if you get a phase transition to a higher E phase (higher potential, so lower kinetic, energy), but that's not the usual behavior anyway (basically an endothermic reaction, so has to be entropy driven), and even if that's what was happening I don't see how it can absorb 99.9% of the KE.
Understood -- I will generate additional training data and compare performance. Will report back if it fixes the problem.
I didn't mean additional training data (although presumably that'll help stability). I meant looking in more detail at this LAMMPS run test, to see how it changes during the weird trajectory. Maybe independently calculate the potential energies in the configuration before/after the T drop (which is presumably associated with a PE increase, assuming total energy is conserved, at least roughly).
Spent a while tinkering with lammps settings; it seems the issue appears when I perform Langevin dynamics without zeroing the random force (the default). That is, combining "fix nve" and "fix langevin zero no". This seems to happen regardless of system temperature; I've tried running the system at 70 K, at 800 K, and at 1800 K, and in all cases running Langevin dynamics without zeroing of the random force results in a crash within ~2000-10000 timesteps (with a 1 fs timestep).
That said, turning on zeroing of the random force (fix langevin zero yes) seems to get rid of the crashing issue completely. This is probably a smarter choice to do in general (running with zero no was a mistake I made when setting up the lammps script above), regardless of the crashing issue, but just wanted to note this down here in case anyone in the future has the same issue. I've run a few different MACE-trained potentials, albeit trained on the same dataset, with different choices of L, cutoff, and number of irreps, and all seem to crash at around the same timestep. It does seem possible that the potentials I am using reach instability, and the crash results from that, however I've not been able to throw the configurations into DFT yet. I'll try to perform some additional testing, specifically targeting systems where DFT is tractable once I have free GPU resources next week.
I've been using the mace potential with "no domain decomposition" a single 32 GB VRAM GPU, and have checked system sizes between 384 and 2592 atoms. GPU RAM usage during "standard" molecular dynamics varies between 8 GB and 30 GB depending on model parameter choice, and system size.
Running dynamics with a Nose-Hoover thermostat also seems to completely remove the crashing issue, although thermostat choice is obviously dependent on the variables that are being measured, so may not be an acceptable solution in all cases.
If you don't zero the forces the system will presumably drift. I'm not sure how the positions are processed by LAMMPS before passing to torch, and where the neighbor list happens, but is it possible that the "raw" positions end up very large, and the neighbor list code is doing something silly, e.g. trying to create bins for a very large apparent box (even though if wrapped by the pbcs they'd all be reasonable)?
To follow up all, and thanks for your help, adding additional training data (I added dimers vs. distance) did make the exact same input script work with no sudden temperature drop. I have not had time to check the PE vs. TE yet for the run that did fail, but I did have time to notice that the crashed occurred whenever two atoms got closer than <0.5A. In terms of L1, changing the page size + max neighbor size does also help with memory usage.
To recap -
The drift is probably responsible for the weird temperature, no? the temperature calculation is based on the atomic velocities.
The drift is probably responsible for the weird temperature, no? the temperature calculation is based on the atomic velocities.
That can be an issue, but I don't see how it could lead to it dropping from 1400 K to 0.005 K ever (and especially not over a single print interval). And a couple of intervals later to 1e-17 K.
true
Sorry, I think the drift is only happening in my case, cannot speak for mhsiron's case where a major temperature drop occurs. I've seen major temperature drops in simulations where the system is blowing up due to model instability, and seemingly the lammps thermostat is desperately trying to get the atom velocities under control.
sorry - confused the two issues. so your crashes get resolved by zeroing the Langevin force sum? I would still like to understand what happens when you have drift, and why that leads to crashes.
I would still like to understand what happens when you have drift, and why that leads to crashes.
I agree. Does anyone know where the code called by LAMMPS gets its neighbor list? Is it from LAMMPS, or does it do its own (when domain decomposition is off, at least)? If the latter, that'd be my first suspect.
As far as I can tell zeroing the Langevin force sum has completely gotten rid of the crashing issue. I don't know the internals of MACE well enough to really speculate on why this is happening, but I've not seen any crashes yet.
Do you have a trajectory from a run that crashed, so we can check if the atoms are drifting?
I don't have one on hand; but can generate one by tomorrow.
Thanks. I think that'd be useful. Do we think it'd be simpler if we moved @owen-rett 's problem to a new issue [edited]
I'll make a new issue real quick; this seems to be getting a bit congested
I would still like to understand what happens when you have drift, and why that leads to crashes.
I agree. Does anyone know where the code called by LAMMPS gets its neighbor list? Is it from LAMMPS, or does it do its own (when domain decomposition is off, at least)? If the latter, that'd be my first suspect.
MACE gets its neighbourlist from lammps.
I was trying to reproduce the original error using a simplified script, and seem to be getting a separate one, "lost atoms", which makes a lot more sense. Regardless, ensuring that the random forces sum to zero seems to be best practices. I've attached a trajectory where this happens to this reply, but given that I can't rule out incompleteness in the training set being the root cause, I don't think I can call it an issue with system drift necessarily.
I think I'll just chalk this one up to either a few mistakes in my input script, model instability, or a combination of both, and work on fixing both problems. If the same issue appears again I'll make a proper issue about it and try to document it more fully. I don't have a good reason why setting the sum of random forces to zero seems to cause trajectories to run without issue, but given that I can't rule out an issue with my model, I'm inclined to put blame on that.
EDIT:
Examining the trajectory, it seems that a Zirconium ion got quite close to another Zirconium ion, which likely is what is causing the blow up; I think I'm more inclined to blame model instability in this case. Again I cannot say why zeroing the sum of random forces seems to prevent this issue. I ran an exact copy of this simulation with the forces zero'd and didn't find this issue. My apologies for not examining the trajectory more closely. The original problem happened in a thermodynamic integration simulation where I was not saving lammps trajectories in order to save disc space.
Edit 2: My current suspicion is that something to do with boundary crossing is going wrong, and placing a Zirconium near a Zirconium, which then causes my model to get annoyed, and this then caused the memory issues above.
Our experience so far is that if you start from a reasonable configuration (ambient pressure and temperature) then MD will not make things blow up. I'm very interested in cases where MD blows up. We are working on a fix that ensures correct atom-atom repulsion for close distances regardless of conditions. If the atoms got close because of some silliness to do with initial conditions, or you are doing random structure search or similar, you might cope with it better by doing a few steps or relaxation with a purely repulsive (or LJ) model, before you turn on MACE-MP-0 - it really depends on your application.
If it's a generic atoms getting too close issue, I don't see how zeroing the total force could make a difference. Would you be able to put together a complete reproducing example (LAMMPS input files + model file) ? Even if we have to run it, I think it's important to figure out whether (and if so why) it's happening when the forces are raw but not when they are zeroed.
I'll should have some free GPU resources early next week. Going to try to put together an example using a potential I trained myself, and then see if the error reappears using a MACE-0 model.
Thanks both @owen-rett and @mhsiron for sticking with this.
Hi all,
If it is of interest I am happy to provide an input script + trained MACE potential which starts from a stable atomic configuration and in the end has all atoms converge like so:
This was ultimately the structure which caused my network/simulation to exhibit the memory error above. It is started by two atoms getting too close during the simulation and was fixed by adding a couple of additional structures of very close atoms in training my network.
Hi @mhsiron I just read through everything again. This summary from you is very helpful
To follow up all, and thanks for your help, adding additional training data (I added dimers vs. distance) did make the exact same input script work with no sudden temperature drop. I have not had time to check the PE vs. TE yet for the run that did fail, but I did have time to notice that the crashed occurred whenever two atoms got closer than <0.5A. In terms of L1, changing the page size + max neighbor size does also help with memory usage. To recap -
- L1 model crashed due to running out of memory on A100 80GB GPUs with system size > 1200 atoms on default neighbor/atom, page size settings. Lowering atom size or page/neighbor/atom size made L1 models run on 80GB.
- L0 model crashed due to unstable model when atoms got too close together. Adding additional dimer data to L0 model made the script successfully run.
and I don't think we need your trajectory. I'm still a bit surprised about the L1 failure with 1200 atoms, but helpful to know about your neighbor list experiments.
In contrast, I don't think we have a good explanation yet for @owen-rett's problem. We can move to a new issue or continue here - either way.
Ok, I've performed a number of runs, 4 each for the Master branch and Repulsion branch of MACE. I have been using the Repulsion branch for primary use, as I find it is a bit more stable in general when performing NEB calculations, however I am experiencing crashing regardless of which branch I use. I performed 2 runs using Langevin dynamics without zeroing of the random force, 1 run using Langevin dynamics with zeroing of the random force, and one using Nose-Hoover dynamics. The Langevin dynamics without zeroing of the random force all encounter a major error and crash between zero and 35k timesteps (at 1 fs timestep), and with zeroing of the random force I still get crashing, albeit typically at 180k timesteps. The Nose-Hoover dynamics does not seem to crash, although I only ran the simulations for 200k timesteps.
I've attached three tar files representing Langevin dynamics with zeroing turned off, and turned on for the master branch, as well as a simulation run using Nose-Hoover dynamics. I can do so with all of the directories, however github is getting annoyed at me due to filesizes (saved every 10 steps to try to catch where the failure happens). I've not included the mace model in the uploads but can email it if necessary.
Langevin_Zero_No_A1.tar.gz Langevin_Zero_Yes_A1.tar.gz Nose_Hoover_A1.tar.gz
Edit: Comment I still cannot rule out that the error springs from model quality, however I've seen few issues running high temperature (up to 2500 K) dynamics, even when going out to 0.5 ns, which I would expect to show model errors more clearly than relatively low temperature fixed cell dynamics. I've also seen similar crashing issues even when running at temperatures as low as 70 K, although only when using fixed unit cells (NVT style dynamics).
did you get the same failures using the repulsion branch as well ?
Sorry, forgot to say, but seeing the same errors on both branches, at similar timesteps. I've attached an example of a failing run from the Repulsion branch below. Langevin_Zero_No_A1_Rep.tar.gz
I downloaded the langevin_zero file, there are only 504 frames, and no crash visible (the atoms look perfectly normal in their positions)
That's what's confusing me. The atoms don't seem to be getting close enough to trigger memory-related crash, and each run is only using ~8 GB out of 32 GB for the GPU during normal use, but e.g. on the langevin_zero simulation I am still getting the following. "an illegal memory access was encountered /home/gridsan/orettenmaier/Lammps_MACE/lammps/lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:161...."
yes I see the crash in the stderr, but I don't think it can be related to the physics of the simulation (like colliding atoms and such like)
A part of me wonders if this is related to the compilation of lammps I'm using. I have a few extra packages turned on, so I'll quickly recompile it with those turned off and report back.
Ok, after recompiling in a fresh build directory with more basic settings the errors have disappeared, or at least haven't appeared 60k steps into a run that was crashing at 5k steps before recompilation. As such I think the issue was down to the compilation. I'll begin adding in packages and checking for instability.
My current suspicion is down to the fact that I had tried to turn on Kokkos-UVM in the past, under the idea that I could squeeze a few more atoms into a simulation. The lammps binary used for the above does not have the Kokkos-UVM option turned on, however was still compiled in the same directory as when I tried to do so (recompiling using cmake . -D Kokkos_ENABLE_CUDA_UVM=no ../cmake). I can't think of a reason for any of the other packages I used to cause memory issues, being MISC and Extra-Fix.
If crashing issues reappear for either MISC or Extra-fix I'll report here, but I suspect this is down to the lammps compilation flags I had used.
Edit: You have my apologies, this should have been one of the first things I checked.
Its still baffling that the error only appeared when performing Langevin dynamics using specifically NVT style dynamics, and never seems to appear using Nose-Hoover dynamics using NVT or NPT ensembles, nor when performing Langevin style dynamics using an NPT ensemble.
Hello MACE developers, your help would be appreciated:
Describe the bug LAMMPS with ML-MACE crashes at different timestep upon sending the very same input script on the very same hosts, with the same stacktrace of an illegal memory access encountered.
LAMMPS scripts works as expected with MACE-MP-0 L0 trained model provided by this repository. But fails at seemingly random timepoint with trained MACE model. Used same training command as MACE-MP-0 L0 except for distributed/num_workers command.
To Reproduce Steps to reproduce the behavior:
LAMMPS Input script:
Input files below (initial structure, LAMMPS trained model): input_files.zip
Stacktrace:
System setup (please complete the following information):
Additional context Simulation is completely stable with pre-trained OS. There is a possibility that the trained model file has been trained on a different PyTorch version.