Closed lkdvos closed 6 months ago
Might help for #111
have you experimented with exci_transfers? I recall doing quite a few unnecessary computations there. There is also the way I construct the excitation environments, which goes row by row through the MPO. In practice this can often be parallelized, but it requires being smarter. (A similar thing can be done for the groundstate environments)
I haven't been focusing too much on parallelization over the different blocks at all, as I mostly wanted to move that into BlockTensorKit, and just experiment with different backends. I must also say that our recent experimentations with multithreading TensorKit have somewhat discouraged me from trying to multithread at that level, as it seems like as soon as the load is not balanced, it is quite hard to outperform just sending all threads to BLAS. One of the things I've been menaing to try is actually the opposite direction, to make the hamiltonian dense for applications of the transfers etc (fully dense for effective hamiltonians or row-wise/column-wise for still being able to do the linear problems in the environments). I have the feeling that at the very least for semi-local hamiltonians, the gain from using BLAS's level of optimisation is so high that the sparseness of the hamiltonian does not really achieve too much
I conceptually really dislike it, but I remember doing exactly that for finite dmrg - blocking the sparse mpo to a dense one - and the gain was quite ridiculous even for sparse and large hamiltonians. I also don't understand why mkl+julia threading would be outperformed by pure mkl threading. Shouldn't there be some memory transferring bottlenecks appearing when you only use mkl's threading?
I think this is actually exactly what is prohibiting making mkl+Julia competitive. One of the optimizations that BLAS relies on quite heavily is making effective use of the cpu caches, which makes them effectively do less work -- they need to load entries from actual ram less often. This is known to be quite a dominant factor for mul!
, especially when the matrices aren't too large. In that sense, trying to do multiple matrix multiplications on different cores invalidates some of that work (at the very least for the L3 cache which is typically shared).
I must admit that I am also not too happy about this, as this is really a "if you can't be smart, just be strong" kind of solution, but if it just works than it's hard to avoid :grin:
I will merge this for now, it seems like this is strictly an improvement, and we can make further improvements later
It's definitely an improvement. After experimenting a bit it's definitely not clear that using MKL over Openblas is an improvement though. This might obviously be due to bad choices in num_threads
Some attempts at speeding up the quasiparticle algorithms: