Explosion of energy per atom on ideal lattice sites with increasing supercell size

ConnorSA commented 3 months ago

Hey all,

I noticed some strange behaviour when using ACE potentials I developed and wondered if this could be a potential bug. In particular, I found that the energy per atom would explode sometimes when increasing super cell size when atoms sat on the lattice sites. Here is a quick python snippet of the test:

atoms=ase.io.read('path/to/Ti_bcc_primitive.castep')
ace_potential="path/to/potential.json"
ACE = pyjulip.ACE1(ace_potential)
for i in range(1,10):
    a=atoms.copy()
    a=a*[i,i,i]
    a.calc=ACE
    print(a.get_potential_energy()/len(a))

which would output the following atomic energies:

-1593.736219589828
-1593.7362195898277
-1593.7362195898286
-1593.7362195898268
-1593.7362195898286
-1477.8662321399227
7594.737651725763
33464.98197817054
433759.4730885219

If I rattle the atomic positions by a tiny amount (0.00001 \AA), then I recover something more in line with what I expected in terms of atomic energy per atom at large super cell sizes (for reference this is -1593.72718054 eV from DFT). I think something strange is happening numerically when atoms happen to sit in particular configurations, wondered if this has been seen before and if it's reproducible? I'm not sure how consequential this might be, but wanted to throw this out there :)

I have managed to get this effect with different ACE potential complexities, but it seems to be inconsistently appearing in particular crystal symmetries (I haven't had this effect on an orthorhombic box yet).

As far as ACE parameters go, I've managed to get this behaviour with different orders and degrees in the model complexity. In this case this is with a order=3, total_degree=15, and cutoff of 6.0 \AA.

julia: ACEpotentials v0.6.2

Many thanks,

Connor University of Warwick

cortner commented 3 months ago

Thanks for reporting this. Feels like some array hasn't been initialized as expected.

Are you in a position to produce a pure Julia test?
If not, can you comment on whether this happens with ANY ACE1 potential or just the specific one you used here?
If only this specific potential, then can you please share the JSON file?

EDIT: apologies I didn't read carefully enough. It sounds like I would be able to recreate this with randomly generated ACE potentials?

cortner commented 3 months ago

On a first attempt I cannot reproduce this, here is my script:

using ACEpotentials 

model = acemodel(elements = [:Ti, ],
                      order = 3,
                      totaldegree = 15,
                      rcut = 6.0,
                      Eref = [:Ti => -1586.0195, ])

# generate random parameters 
params = randn(length(model.basis))
# params = params ./ (1:length(params)).^2    # (this is not needed)
ACEpotentials._set_params!(model, params)

function energy_per_at(pot, i) 
   at = bulk(:Ti) * i 
   return JuLIP.energy(pot, at) / length(at)
end

E_per_at = [ energy_per_at(model.potential, i) for i = 1:10 ]

output:

julia> E_per_at
10-element Vector{Float64}:
 -1587.2195904712455
 -1587.2195904712453
 -1587.2195904712453
 -1587.2195904712457
 -1587.2195904712444
 -1587.219590471243
 -1587.2195904712419
 -1587.2195904712412
 -1587.2195904712414
 -1587.2195904712414

cortner commented 3 months ago

@ConnorSA -- if you can reproduce your bug with a randomly generated potential (similar to mine) then the most likely consequence is that this is a Julia-Python interface bug.

CC @casv2 @tjjarvinen @wcwitt

ConnorSA commented 3 months ago

Thank you for sharing this code snippet I will run this now, I haven't tried any ACE1 potential, as far as I know this was just a specific thing I was seeing with a potential I trained.

Yeah, so I was wondering if others had any simple single species ACE potentials that they could try and recreate this super cell size effect elsewhere? I have noticed with my tests that this appears quite inconsistently, in the case of HCP Ti I didn't get this explosion, but for BCC the above is what I got.

As far as sharing JSONs go I'm not sure if I'm in a position to share this immediately (copyright stuff with my funding).

After running the above julia code snippet I got this output:

10-element Vector{Float64}:
 -1589.9067641795925
 -1589.906764179592
 -1589.906764179592
 -1589.9067641795918
 58988.57939066435
 -1589.906764179594
 -1589.906764179592
 -1589.9067641795905
 -1589.9067641795923
     1.4333233269000164e8

cortner commented 3 months ago

Ok that's scary. Please post your project status

ConnorSA commented 3 months ago

Ah ok so I might running some outdated packages:


⌃ [e3f9bc04] ACE1 v0.11.15
⌃ [3b96b61c] ACEpotentials v0.6.2
  [51974c44] ASE v0.5.4
  [352459e4] ExtXYZ v0.1.14-DEV `~/.julia/dev/ExtXYZ`
  [7073ff75] IJulia v1.24.2
⌃ [945c410c] JuLIP v0.14.2
  [6f286f6a] MultivariateStats v0.10.2
  [91a5bcdd] Plots v1.39.0
  [438e738f] PyCall v1.96.1
  [37e2e46d] LinearAlgebra

cortner commented 3 months ago

Would you mind rerunning in an updated or in a fresh environment?

cortner commented 3 months ago

I tried to reproduce the bug with an environment similar to yours,

(bug_connor) pkg> status
Status `~/gits/scratch/bug_connor/Project.toml`
⌅ [e3f9bc04] ACE1 v0.11.15
⌅ [3b96b61c] ACEpotentials v0.6.2
⌅ [945c410c] JuLIP v0.14.2

but couldn't...

I also tried it with a BCC structure and still can't reproduce it.

ConnorSA commented 3 months ago

I have just updated the packages and re-ran the julia snippet and I'm getting some more consistent results now

10-element Vector{Float64}:
 -1586.6366298294947
 -1586.6366298294943
 -1586.6366298294943
 -1586.636629829494
 -1586.6260240094525
 -1586.636629829496
 -1586.6366298294945
 -1586.6366298294927
 -1586.6366298294947
 -1586.6338695567263

Not sure what was going on, I am just double checking that this has resolved in the Python environment too :)

cortner commented 3 months ago

this is still strange. Note how I get machine-precision whereas your precision deteriorates for very specific system sizes. I would still consider this a bug.

If I can't reproduce this on my own hardware / environment, you may have to give me acccess to yours to debug.

jameskermode commented 3 months ago

Where are you running this Connor? If it's on SCRTP I can also try to reproduce.

ConnorSA commented 3 months ago

This is on the hetmathsys nodes

cortner commented 3 months ago

@ConnorSA please also check whether you get this bug also for cubic computational cells?

jameskermode commented 3 months ago

What version of Julia are you using?

ConnorSA commented 3 months ago

julia version 1.9.3

jameskermode commented 3 months ago

Testing now on same machine with 1.9.2 (which I had to hand).

@cortner If you still have an SCRTP login (probably unlikely) then you can access these machines at hetmathsys1.scrtp.warwick.ac.uk. If not I will see if I can reproduce on a standalone Linux machine that I could give you access to.

ConnorSA commented 3 months ago

So after re-running in python again I'm still getting some variation with super cell size (although it's far less significant than previously, but still indicating something is wrong). for orthorhombic cells It still appears to be fine (it's the same value across supercells to machine precision).

jameskermode commented 3 months ago

Reproduced on hetmathsys1 with similar sensitivity with a clean environment and ACEpotentials v0.6.7. Trying on another machine with Julia 1.10.2.

10-element Vector{Float64}:
 -19550.32896019295
 -19550.328960192986
 -19550.328960193
 -19550.32896019301
 -18629.440850654497
 -19550.328960192863
 -19550.328960192768
 -19550.328960192823
 -19550.328960193307
 -19286.572770616513

jameskermode commented 3 months ago

Upgrading to Julia v1.10.4 on hetmathsys1 seems to remove the instability. Can you confirm this @ConnorSA? I saw the same on my other machine. No clue why this would be, however.

10-element Vector{Float64}:
 -6380.619862204434
 -6380.619862204448
 -6380.619862204453
 -6380.6198622044385
 -6380.619862204455
 -6380.619862204479
 -6380.619862204406
 -6380.619862204358
 -6380.61986220433
 -6380.619862204427

cortner commented 3 months ago

I can't reproduce it with 1.9 either:

(base) ortner@CosmicCrisp bug_connor % j19 --project=. reproduce1.jl
10-element Vector{Float64}:
 -1591.4878094100202
 -1591.48780941002
 -1591.48780941002
 -1591.4878094100204
 -1591.487809410019
 -1591.4878094100177
 -1591.4878094100166
 -1591.4878094100159
 -1591.4878094100159
 -1591.4878094100159

cortner commented 3 months ago

So the big question is what do we do if we can remove the problem by updating everything. Do we still try to track down the bug? Or do we just post a warning and set new version bounds for everything?

jameskermode commented 3 months ago

Reverting to Julia 1.9.4 on my other Linux machine also brings back the instability, so I think there must have been a change somewhere, perhaps in the linear algebra libraries? Might be a macOS vs Linux thing.

I'd be OK with adding a regression test to check this doesn't come back and setting version bounds of Julia 1.10 onwards.

10-element Vector{Float64}:
 -6832.340515552775
 -6832.3405155528
 -6832.340515552792
 -6832.3405155528
 -6592.010255233377
 -6832.340515552779
 -6832.340515552786
 -6832.340515552833
 -6832.340515552818
 -6756.199673696969

julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 1 on 32 virtual cores
Environment:
  JULIA_IMAGE_THREADS = 1

ConnorSA commented 3 months ago

I've just updated to the latest julia (fresh install) and checked in the julia env and get this:

-9998.238035635983,
 -9998.238035635995,
 -9998.238035635984,
 -9998.238035636,
 -9998.238035636003,
 -9998.238035636034,
 -9998.23803563606,
 -9998.238035636054,
 -9998.238035635795,
 -9998.238035635635

cortner commented 3 months ago

I'd be OK with adding a regression test to check this doesn't come back and setting version bounds of Julia 1.10 onwards.

Ok, this makes sense. I will first try to add a test that fails on julia 1.9 in CI but passes on 1.10. And then we add the version bounds.

@ConnorSA -- for now does this solve your problem?

ConnorSA commented 3 months ago

Yes for now using the most recent versions of Julia with fresh packages etc have manged to resolve this bug.

This might be related, but another colleague of mine had some potentially similar issue, however, this was on the side of the potential fitting itself. I can't remember the specifics but it had something to do with using a different version of a crystal cell (i.e. going from an orthorhombic description to some cell that had acute angles - same cell symmetry) and this resulted in some terrible fits when that particular geometry was included in the training data. I will ask him to try and see if he can recreate that scenario but with updated versions of stuff :)

Thanks again for all your help!

Connor

cortner commented 3 months ago

That sounds more like a neighbourlist issue. But either way please ask to file an issue.

cortner commented 2 months ago

CLosed by #213 and #214

ACEsuit / ACEpotentials.jl

Explosion of energy per atom on ideal lattice sites with increasing supercell size #208