Quasiparticle optimizations

QuantumKitHub / MPSKit.jl

A Julia package dedicated to simulating quantum many-body systems using Matrix Product States (MPS)

MIT License

126 stars 28 forks source link

Quasiparticle optimizations #113

Closed lkdvos closed 6 months ago

lkdvos commented 7 months ago

Some attempts at speeding up the quasiparticle algorithms:

removes the computation of renormalization energy from the effective excitation hamiltonian, as this does not need to be recomputed
adds multithreading over sites, similar to how groundstate algorithms work

lkdvos commented 7 months ago

Might help for #111

maartenvd commented 6 months ago

have you experimented with exci_transfers? I recall doing quite a few unnecessary computations there. There is also the way I construct the excitation environments, which goes row by row through the MPO. In practice this can often be parallelized, but it requires being smarter. (A similar thing can be done for the groundstate environments)

lkdvos commented 6 months ago

I haven't been focusing too much on parallelization over the different blocks at all, as I mostly wanted to move that into BlockTensorKit, and just experiment with different backends. I must also say that our recent experimentations with multithreading TensorKit have somewhat discouraged me from trying to multithread at that level, as it seems like as soon as the load is not balanced, it is quite hard to outperform just sending all threads to BLAS. One of the things I've been menaing to try is actually the opposite direction, to make the hamiltonian dense for applications of the transfers etc (fully dense for effective hamiltonians or row-wise/column-wise for still being able to do the linear problems in the environments). I have the feeling that at the very least for semi-local hamiltonians, the gain from using BLAS's level of optimisation is so high that the sparseness of the hamiltonian does not really achieve too much

maartenvd commented 6 months ago

I conceptually really dislike it, but I remember doing exactly that for finite dmrg - blocking the sparse mpo to a dense one - and the gain was quite ridiculous even for sparse and large hamiltonians. I also don't understand why mkl+julia threading would be outperformed by pure mkl threading. Shouldn't there be some memory transferring bottlenecks appearing when you only use mkl's threading?

lkdvos commented 6 months ago

I think this is actually exactly what is prohibiting making mkl+Julia competitive. One of the optimizations that BLAS relies on quite heavily is making effective use of the cpu caches, which makes them effectively do less work -- they need to load entries from actual ram less often. This is known to be quite a dominant factor for mul!, especially when the matrices aren't too large. In that sense, trying to do multiple matrix multiplications on different cores invalidates some of that work (at the very least for the L3 cache which is typically shared).

I must admit that I am also not too happy about this, as this is really a "if you can't be smart, just be strong" kind of solution, but if it just works than it's hard to avoid :grin:

lkdvos commented 6 months ago

I will merge this for now, it seems like this is strictly an improvement, and we can make further improvements later

Gertian commented 6 months ago

It's definitely an improvement. After experimenting a bit it's definitely not clear that using MKL over Openblas is an improvement though. This might obviously be due to bad choices in num_threads