SU(N) integration and sundry associated to-dos

kbarros commented 2 years ago

As we've discussed, another action item is to allow the spin dipoles to be arbitrarily normalized. I did a scan through the project, and I think these are necessary changes (but not necessarily sufficient):

Remove these rescalings with kappa: https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Ewald.jl#L470 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/FourierAccel.jl#L46 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Interactions.jl#L313 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/PairInteractions.jl#L130 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/StructureFactors.jl#L154

Remove spin_mags field here: https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Hamiltonian.jl#L163

Must preserve spin magnitude in multiple places for the integrators/samplers. Here are some examples (may be missing some, testing needed): https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Metropolis.jl#L144 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Integrators.jl#L175 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Integrators.jl#L201 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Integrators.jl#L223 https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Integrators.jl#L238

The scaling of Langevin integrator with |s| needs to be checked carefully. This may require working through some math: https://github.com/SunnySuite/Sunny.jl/blob/refactor-todos/src/Integrators.jl#L190

kbarros commented 2 years ago

The scaling of Langevin integrator with |s| needs to be checked carefully. This may require working through some math:

Actually, we have a better way to handle this now -- expressing the noise and damping terms entirely like this: $s \times (\eta + s \times \partial E / \partial s)$ seems to be the thing that properly generalizes to SU(N), and avoids complicated questions about the spin scaling. It's not how Mentink et al. wrote it, but I think we should adopt this improved convention.

BTW, note that very recently Github started allowing Latex. You can get it with the $ ... $ notation.

ddahlbom commented 2 years ago

Thanks for making that list, Kipton. I'll go through the list and start making those modifications.

kbarros commented 2 years ago

Just checking in -- what's the status of this pull request? I ask partly because my recent work on anisotropy analysis should be merged into this.

ddahlbom commented 2 years ago

Got caught up adding a new sampling method for FeI2 did didn't get to this the spin scaling revisions.

As far as I can tell, everything works, but scaling is still done the old way for LL, and there's no rescaling for SU(N).

I'm away for the next week, but feel free to do whatever you'd like to with it. The scaling stuff can be added later.

On Wed, Jun 8, 2022, 21:00 Kipton Barros @.***> wrote:

Just checking in -- what's the status of this pull request? I ask partly because my recent work on anisotropy analysis should be merged into this.

— Reply to this email directly, view it on GitHub https://github.com/SunnySuite/Sunny.jl/pull/14#issuecomment-1150561056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIERZR2TTITJAPVK3KQLVUDVOE6ZNANCNFSM5XFYRNZA . You are receiving this because you authored the thread.Message ID: @.***>

kbarros commented 2 years ago

Xiaojian and Matt pointed out that indexing into a lattice to get a coordinate should not use 1-based indexing. In other words, the first unit cell should give an offset of (0,0,0). In practice, this means we should use brav .- 1 rather than brav directly inside Lattice.jl

ddahlbom commented 2 years ago

I just pushed the spin rescaling material. A few notes:

Original spin material (not SU(N))

the _dipoles are now rescaled directly, i.e. they are no longer uniformly scaled to 1.0 with interactions rescaled to compensate.
The noise in the LL dynamics has been simplified, with $\xi$ only appearing once and $D=\lambda k_bT$.

SU(N) material

Rescaling occurs exclusively when calculating expected values. Otherwise unchanged.

Other notes

Scaling spins this way introduced a number of non-optimal loop structures (fastest index not the first). Changing the basic memory layout of the spins is probably a pretty mindless change, but it will touch a lot of the code. In particular, I imagine it would impact Matt's work on plotting and possibly MC. So, perhaps it is sensible to get everything merged before going through and making those changes.
I added some tests to make sure the spin magnitudes are stable and to make sure that energy scales appropriately with $\kappa$ for different interactions.

kbarros commented 2 years ago

This is all shaping up very nicely! My inclination is to merge as soon as all these tests pass.

kbarros commented 2 years ago

Could you elaborate how this spin rescaling has resulted in non optimal loop structures? Naively, it seemed like it wouldn't require much extra computational cost. We could do a zoom call.

ddahlbom commented 2 years ago

I just ended up writing more loops like this,

https://github.com/SunnySuite/Sunny.jl/blob/bc2df7f897f45e9b32a276747bdeddecfd3fed60/src/Systems.jl#L88-L96

I haven't tried just iterating over the first index on the inner loop. I was just following the typical iteration pattern that has been used, for example, in the energy and gradient calculations. If the site_infos are small enough (or we make them a struct of arrays), there may be no need to redo the indexing, i.e. we can just iterate over sites first and then cells.

Also, happy to chat on Zoom at some point. I finished writing a first pass of the tests we discussed, but I have a few things I'd like to run by you.

kbarros commented 2 years ago

So the fundamental problem here is that the fast index should be the index over unit cells, not the index of one atom within a unit cell? I agree that the other way sounds better, e.g., to fit well with simd

Kipton Barros, from phone

On Thu, Jun 23, 2022, 07:36 David A. Dahlbom @.***> wrote:

I just ended up writing more look structures like this,

https://github.com/SunnySuite/Sunny.jl/blob/bc2df7f897f45e9b32a276747bdeddecfd3fed60/src/Systems.jl#L88-L96

I haven't tried just iterating over the first index on the inner loop. I was just following the typical iteration pattern that has been used, for example, in the energy and gradient calculations. If the site_infos are small enough (or we make them a struct of arrays), there may be no need to redo the indexing, i.e. we can just iterate over sites first and then cells.

— Reply to this email directly, view it on GitHub https://github.com/SunnySuite/Sunny.jl/pull/14#issuecomment-1164418181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACG3HMBUZXL6JKK3SMLTVTVQRR53ANCNFSM5XFYRNZA . You are receiving this because you commented.Message ID: <SunnySuite/Sunny .@.***>

ddahlbom commented 2 years ago

Yes, exactly.

kbarros commented 2 years ago

Perhaps that overlaps with what Cole was doing in his experimental GPU branch?

On Thu, Jun 23, 2022, 08:19 David A. Dahlbom @.***> wrote:

Yes, exactly.

— Reply to this email directly, view it on GitHub https://github.com/SunnySuite/Sunny.jl/pull/14#issuecomment-1164469899, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACG3HLMXFZAPLAU6UIFRR3VQRXANANCNFSM5XFYRNZA . You are receiving this because you commented.Message ID: <SunnySuite/Sunny .@.***>

kbarros commented 2 years ago

David and I were discussing on Slack that there are currently performance issues associated with allocations for the NxN matrix operations. He will test whether they can be resolved by using an appropriate reinterpret command, which would enable non-allocating mul! operations.

ddahlbom commented 2 years ago

The reinterpret + mul! approach worked fine after a bit of fiddling. SU(N) evolve! no longer causes any allocations (even for reinterpret). The speedup was about 4X. I'm hoping changing the fast index will get us more, but perhaps this is enough optimization for the moment. Let me know your thoughts.

I'm happy to merge at any point. It would be good to begin tying in the symmetry stuff.

kbarros commented 2 years ago

I say we merge, and keep polishing in another PR.

The speedup was about 4X

On Slack you mentioned an initial 10x slowdown relative to a non-allocating version, and a 25x slowdown relative to your optimized version. Does this mean there is still a 6x factor still on the table? (Presumably can be unlocked with changing the indexing?)

Also, for those not following the whole conversation, I assume these slowdowns are specific to the SU(N) dynamics. The spin-dipole stuff should not have changed in performance... perhaps we should benchmark that explicitly?

ddahlbom commented 2 years ago

I say we merge, and keep polishing in another PR.

Sounds like a good approach to me.

Does this mean there is still a 6x factor still on the table? (Presumably can be unlocked with changing the indexing?)

Yes, I think something like that. Of course my optimized code was much less general than Sunny, but I think we should aim for increases in the 5X ballpark.

The spin-dipole stuff should not have changed in performance... perhaps we should benchmark that explicitly?

Any performance changes to the spin-dipole material would be due to the change in spin scaling, which involved adding loop structures like the one referenced above (for normalization operations). We can take a look at this before merging, if you'd like.

ddahlbom commented 2 years ago

I found that allocations were being caused by calls to the field! function, which had an N type parameter. I eliminated this parameter (which was unnecessary, since field! doesn't involve any SU(N) calculations). This change eliminated the allocations.

There is still some performance discrepancy for the SchrodingerMidpoint method. For a large system, it's a couple of percent. For a small system, it's 10-15%. I think this is entirely due to the increased complexity of the normalization function (modified to include site-specific scaling). This function ends up getting called much more often in SchrodingerMidpoint. I tried multiple loop structures for the normalization, and the results were always similar.

Here are some typical profiling results. The 1-step was taken with @btime, the 1000 step with @time.

MAIN BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration
1 step:       282.236 μs (0 allocations: 0 bytes)
1_000 steps:      0.308247 seconds

Spherical midpoint integration
1 step:       682.701 μs (0 allocations: 0 bytes)
1_000 steps:      0.740743 seconds

Dims: (8, 8, 8)

Langevin integration
1 step:       2.164 ms (0 allocations: 0 bytes)
1_000 steps:      2.355835 seconds

Spherical midpoint integration
1 step:       6.574 ms (0 allocations: 0 bytes)
1_000 steps:      6.718775 seconds

SU(N) BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration
1 step:       289.213 μs (0 allocations: 0 bytes)
1_000 steps:      0.296489 seconds

Spherical midpoint integration
1 step:       767.111 μs (0 allocations: 0 bytes)
1_000 steps:      0.834345 seconds

Dims: (8, 8, 8)

Langevin integration
1 step:       2.201 ms (0 allocations: 0 bytes)
1_000 steps:      2.251839 seconds

Spherical midpoint integration
1 step:       6.594 ms (0 allocations: 0 bytes)
1_000 steps:      6.869419 seconds

ddahlbom commented 2 years ago

After much help from Kipton, it became clear that the speed difference had nothing to do with the new normalize! function, but was in fact due to the fact that I was not running equivalent models in both versions of Sunny. Specifically, I hadn't realized that the default spin magnitude in the main branch is 0.5 (implemented as a rescaling of interactions). So, my parameters were twice as large in the SU(N) Sunny version. This was effectively like using a time step that is twice as large and running for the same number of steps. As such, SU(N) Sunny required more iterations for the midpoint method to converge. This was the origin of the slowdown.

After controlling for spin magnitude correctly, I get identical trajectories and identical convergence characteristics from both versions. The speeds are now nearly indistinguishable. See below.

There's actually nothing new to commit at this point. The benchmarking tests were not being tracked.

Now doing some final checks of the thermostat (and its interaction with \kappa scaling) for the generalized dynamics.

MAIN BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration
1 step:           287.301 μs (0 allocations: 0 bytes)
1_000 steps:      0.295145 seconds

Spherical midpoint integration
1 step:           726.629 μs (0 allocations: 0 bytes)
1_000 steps:      0.889243 seconds (1 allocation: 16 bytes)
Avg number of steps: 5.693

Dims: (8, 8, 8)

Langevin integration
1 step:           2.207 ms (0 allocations: 0 bytes)
1_000 steps:      2.385613 seconds

Spherical midpoint integration
1 step:           6.724 ms (0 allocations: 0 bytes)
1_000 steps:      6.919099 seconds (1 allocation: 16 bytes)
Avg number of steps: 5.989

SU(N) BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration
1 step:           286.830 μs (0 allocations: 0 bytes)
1_000 steps:      0.292789 seconds

Spherical midpoint integration
1 step:           706.295 μs (0 allocations: 0 bytes)
1_000 steps:      0.811342 seconds (1 allocation: 16 bytes)
Avg number of steps: 5.693

Dims: (8, 8, 8)

Langevin integration
1 step:           2.178 ms (0 allocations: 0 bytes)
1_000 steps:      2.296884 seconds

Spherical midpoint integration
1 step:           6.747 ms (0 allocations: 0 bytes)
1_000 steps:      6.797048 seconds (1 allocation: 16 bytes)
Avg number of steps: 5.989

kbarros commented 2 years ago

Can you please verify that the number of iterations in spherical midpoint is actually the same? A difference in effective convergence tolerance could explain this.

On Fri, Jun 24, 2022, 13:10 David A. Dahlbom @.***> wrote:

I found that allocations were being caused by calls to the field! function, which had an N type parameter. I eliminated this parameter (which was unnecessary, since field! doesn't involve any SU(N) calculations). This change eliminated the allocations.

There is still some performance discrepancy for the SchrodingerMidpoint method. For a large system, it's a couple of percent. For a small system, it's 10-15%. I think this is entirely due to the increased complexity of the normalization function (modified to include site-specific scaling). This function ends up getting called much more often in SchrodingerMidpoint. I tried multiple loop structures for the normalization, and the results were always similar.

Here are some typical profiling results. The 1-step was taken with @btime, the 1000 step with @time.

MAIN BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration

1 step: 282.236 μs (0 allocations: 0 bytes)

1_000 steps: 0.308247 seconds

Spherical midpoint integration

1 step: 682.701 μs (0 allocations: 0 bytes)

1_000 steps: 0.740743 seconds

Dims: (8, 8, 8)

Langevin integration

1 step: 2.164 ms (0 allocations: 0 bytes)

1_000 steps: 2.355835 seconds

Spherical midpoint integration

1 step: 6.574 ms (0 allocations: 0 bytes)

1_000 steps: 6.718775 seconds

SU(N) BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration

1 step: 289.213 μs (0 allocations: 0 bytes)

1_000 steps: 0.296489 seconds

Spherical midpoint integration

1 step: 767.111 μs (0 allocations: 0 bytes)

1_000 steps: 0.834345 seconds

Dims: (8, 8, 8)

Langevin integration

1 step: 2.201 ms (0 allocations: 0 bytes)

1_000 steps: 2.251839 seconds

Spherical midpoint integration

1 step: 6.594 ms (0 allocations: 0 bytes)

1_000 steps: 6.869419 seconds

— Reply to this email directly, view it on GitHub https://github.com/SunnySuite/Sunny.jl/pull/14#issuecomment-1165857730, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACG3HLHDFJFRBRK47ZF6G3VQYB3NANCNFSM5XFYRNZA . You are receiving this because you commented.Message ID: <SunnySuite/Sunny .@.***>

ddahlbom commented 2 years ago

Not sure if this was somehow sent by accident, but I believe this is the middle of the discussion that was resolved before merging back in June. I.e., we didn't merge before confirming we had identical behavior in both cases (in terms of numbers of iterations and identical trajectories). This is in fact the content of the message immediately after the one quoted below.

What is the context for bringing this back up?

On Tue, Oct 11, 2022, 02:56 Kipton Barros @.***> wrote:

Can you please verify that the number of iterations in spherical midpoint is actually the same? A difference in effective convergence tolerance could explain this.

On Fri, Jun 24, 2022, 13:10 David A. Dahlbom @.***> wrote:

I found that allocations were being caused by calls to the field! function, which had an N type parameter. I eliminated this parameter (which was unnecessary, since field! doesn't involve any SU(N) calculations). This change eliminated the allocations.

There is still some performance discrepancy for the SchrodingerMidpoint method. For a large system, it's a couple of percent. For a small system, it's 10-15%. I think this is entirely due to the increased complexity of the normalization function (modified to include site-specific scaling). This function ends up getting called much more often in SchrodingerMidpoint. I tried multiple loop structures for the normalization, and the results were always similar.

Here are some typical profiling results. The 1-step was taken with @btime, the 1000 step with @time.

MAIN BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration

1 step: 282.236 μs (0 allocations: 0 bytes)

1_000 steps: 0.308247 seconds

Spherical midpoint integration

1 step: 682.701 μs (0 allocations: 0 bytes)

1_000 steps: 0.740743 seconds

Dims: (8, 8, 8)

Langevin integration

1 step: 2.164 ms (0 allocations: 0 bytes)

1_000 steps: 2.355835 seconds

Spherical midpoint integration

1 step: 6.574 ms (0 allocations: 0 bytes)

1_000 steps: 6.718775 seconds

SU(N) BRANCH, TYPICAL RESULT

Dims: (4, 4, 4)

Langevin integration

1 step: 289.213 μs (0 allocations: 0 bytes)

1_000 steps: 0.296489 seconds

Spherical midpoint integration

1 step: 767.111 μs (0 allocations: 0 bytes)

1_000 steps: 0.834345 seconds

Dims: (8, 8, 8)

Langevin integration

1 step: 2.201 ms (0 allocations: 0 bytes)

1_000 steps: 2.251839 seconds

Spherical midpoint integration

1 step: 6.594 ms (0 allocations: 0 bytes)

1_000 steps: 6.869419 seconds

— Reply to this email directly, view it on GitHub <https://github.com/SunnySuite/Sunny.jl/pull/14#issuecomment-1165857730 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AACG3HLHDFJFRBRK47ZF6G3VQYB3NANCNFSM5XFYRNZA

. You are receiving this because you commented.Message ID: <SunnySuite/Sunny .@.***>

— Reply to this email directly, view it on GitHub https://github.com/SunnySuite/Sunny.jl/pull/14#issuecomment-1274172520, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIERZR3GU442HYHC4ZXMIIDWCUFTJANCNFSM5XFYRNZA . You are receiving this because you modified the open/close state.Message ID: @.***>

kbarros commented 2 years ago

It was somehow sent by accident. I don't know how though.

SunnySuite / Sunny.jl