Crosscheck with Paper, complex matrix optimizations

ffreyer commented 2 years ago

This pr will add a crosscheck/example comparing to https://arxiv.org/pdf/1912.08848.pdf.

TODO

[x] finish complex linalg optimizations
[x] implement model
[x] get matching observables (t5 = 0, L = 8 for simplicity)
- [x] Z - spin susceptibility (figure 1d red)
- [x] Superfluid Stiffness (figure 2b)
- [x] s-wave pairing correlation (figure 3a, solid blue)
- [x] charge susceptibility (figure 3a, solid red)
[x] document the crosscheck as an example
[x] update other documentation

General Changes:

switch to $\Delta(i) \Delta^\dagger(k) + \Delta^\dagger(i) \Delta(k)$ for pairing correlation. This matches the paper above (removing terms that are 0 in DQMC) and some others (e.g. the triangular crosscheck).
fix some issues with loading moved BufferedConfigRecorder files and saving uninitialized BufferedConfigRecorder
better heuristic for counting the number of NNs (actually counting equal length distances)
add warning if T is not (approximately) hermitian
rename dagger to swapop (see #135)
add superfluid_stiffness based on this paper

Complex Linear Algebra

The model from the reference paper uses complex hoppings. To be able to run it efficiently this pr implements all the neglected complex linear algebra. It also cleans up BlockDiagonal so that it works with ComplexF64 and can be combined with complex StructArrays. Closes #78

cleanup/fix BlockDiagonal for generic matrices (i.e. complex)
- let all vmul! (and alike) methods fall back onto other vmul! methods. This performs a little bit worse, but makes things generic. Maybe @inline would help?
- remove some resulting redundancies
- generalize most methods to arbitrarily sized blocks (though typically still square)
finish implementing Complex StructArrays
- closes #78
- note that BLAS will use threads by default while our LoopVectorization methods do not. With multiple threads BLAS will outperform, with BLAS.set_num_threads(1) our methods win. I assume in the one simulation per core setting (i.e. on a supercomputer) the single threaded BLAS performance is relevant.

Lattice Iterator revamp/changes

The reference paper calculates the current current correlations with unsynchronized directions. However it's not just every bond up to the K-th farthest. It's every bond that has a hopping associated to it without reversed bonds (these are explicitly included in the formula pre Wicks) and without bonds with dir[bond][1] == 0. This gets rid of quite a few bonds. With out current iterators and NN and NNN bonds, for example, we'd include 9 directions (on-site, 4x NN, 4x NNN) but only really need 3 (+x NN, 2 +x NNN). And because it'S usnyhcronized the number of directions goes in squared, so we do 9²/3² = 9 times the work.

To fix this I've revamped the lattice iterator system here. The main goal was to allow passing directional indices directly, e.g. EachLocalQuadByDistance([2, 5, 9]). I also wanted to stop relying on only one of each lattice iterator type being created at the start of the simulation to memory usage down. So I made the following changes:

There is now a LatticeIteratorCache attached to DQMC. Perhaps later I will put this together with the lattice. The cache contains a dict that gets filled with maps such as dir_idx -> (src, trg) which are accessible via cache[Dir2SrcTrg()] and can be created via push!(cache, Dir2SrcTrg(), lattice). Every lattice iterator will create the maps it needs (no duplicated) and holds references to it for iteration. So now the difference between 1 and 1000 EachLocalQuadByDistance iterators is just 3000 integers and 3000 references.

The second change I made was that every iterator now has a template version and a runtime version. The template is light, used as a field in DQMCMeasurements and can be saved without extra care. The runtime version (same type name with _ in front) follows from the template and will construct the maps it needs during creation.

ffreyer commented 2 years ago

https://carstenbauer.github.io/MonteCarlo.jl/dev/examples/triangular_Hubbard/ clearly does not \Delta^\dagger \Delta + \Delta |delta^\dagger for its pairing correlation. It fits way worse than \Delta \Delta^\dagger. Needs to be adjusted if I go through with switching the default from pc_kernel to pc_combined_kernel. The reference for this PR clearly does use the combined kernel though...

ffreyer commented 2 years ago

I benchmarked the new model with StructArrays - thermalization is twice as fast (roughly 0.5s -> 0.25s in my test), but measurements are just as slow (2s -> 2s, occupation and pairing susceptibility).

I see two potential reasons for this:

The real and imaginary parts of the Greens function are not indexed in separate loops, which is not cache friendly.
Lattice iterators themselves are probably not cache friendly. They're essentially semi-random lists of indices. But given that they are static per simulation there is probably some optimization potential here as well...

ffreyer commented 2 years ago

Making use of StructArrays in measurements is tough atm. There is no left vs right multiplier like in mul!, matrix elements could technically be multiplied by i or conjugates could be taken. So instead I'm just converting complex StructArrays back to plain complex matrices. This isn't optimal, but still a lot better... (~2s -> ~1.33s)

carstenbauer / MonteCarlo.jl