WIP: Performance 1 - Fixing States and 1pBasis Implementations

cortner commented 2 years ago

first steps towards making ACE2 performant. There are a few cleanups, starting to put together some benchmarks, but the main contributions here are

rewrite the State and DState implementations to get around a Julia 1.7 bug
fix a nasty bug in the product1p basis gradient implementation.

The 1p basis evaluation now seems reasonably performant. But unfortunately there is a segfault left that I cannot track down yet. It occurs only with Julia 1.7, not with 1.6, when testing the B1pMultiplier:

[ Info: Testing B1pMultiplier
[ Info: some basic tests

signal (11): Segmentation fault: 11
in expression starting at /Users/ortner/gits/ACE.jl/test/test_multiplier.jl:35
ntuple at ./ntuple.jl:0
unknown function (ip: 0x280d16cdf)
_jl_invoke at /Users/ortner/gits/julia17/src/gf.c:0 [inlined]
jl_apply_generic at /Users/ortner/gits/julia17/src/gf.c:2429
set_spec! at /Users/ortner/gits/ACE.jl/src/product_1pbasis.jl:197
init1pspec! at /Users/ortner/gits/ACE.jl/src/sparsegrids.jl:27
unknown function (ip: 0x280d1599b)
_jl_invoke at /Users/ortner/gits/julia17/src/gf.c:0 [inlined]
jl_apply_generic at /Users/ortner/gits/julia17/src/gf.c:2429
jl_apply at /Users/ortner/gits/julia17/src/./julia.h:1788 [inlined]
do_call at /Users/ortner/gits/julia17/src/interpreter.c:126
eval_body at /Users/ortner/gits/julia17/src/interpreter.c:0
....

and the rest is not so interesting. It seems reproducible.

cortner commented 2 years ago

Update ... I'm running into a horrible issue:

If I profile with 1.6 instead of 1.7 I lose more than a factor 2 performance.
If I profile on our server, I lose a factor 4-5 even when on 1.7.

It seems that somehow my optimizations work really well but only on my M1 processor?

cortner commented 2 years ago

performance on 1.6 is now fixed on my laptop, but still horrendous on our server. This very strange...

cortner commented 2 years ago

~~@andresrossb If you have a free moment, would you be willing to pull this branch and run the `profile/profile_basis.jl' on your laptop? @zhanglw0521 as well?~~

Ignore this, I messed up my test. The performance is now comparable on the server and on my M1.

cortner commented 2 years ago

(but seriously, the M1pro is a beast - I get a factor 3 faster performance for the gradients than on the EPYC, which isn't exactly a slouch either...)

zhanglw0521 commented 2 years ago

(but seriously, the M1pro is a beast - I get a factor 3 faster performance for the gradients than on the EPYC, which isn't exactly a slouch either...)

Sounds really really attractive...

cortner commented 2 years ago

yes, but you'll never get enough M1pro cores to make it a serious contender....

cortner commented 2 years ago

Looks like PkgBenchmark is worth integrating into our workflow... judge.pdf

cortner commented 2 years ago

this has significant improvements on basis evaluation, so I'll merge and tag before we move on to LinearACEModel.

ACEsuit / ACE.jl

WIP: Performance 1 - Fixing States and 1pBasis Implementations #93