The multiplications and traces carried out in the final stage of the computation are highly inefficient.
This is because these are many thousands of very small matrix multiplications which should ideally be reshuffled into a few multiplications of very large matrices. At the very least, the involved matrices should have side lengths of a few hundred elements, such that they can run efficiently.
Another option might be to implement this final multiplication by hand. One would store the factors as custom AoSoA rather than Eigen matrices, splitting the real and imaginary parts as is usually done to achieve high vectorisation efficiency for complex multiplications.
The usage of a blocked custom data structure would allow explicit prefetches to be issued (load instructions for the next small matrix between computations for the current small matrix) and the computation could further be optimised to maximize cache reuse (by a single thread).
The latter would require to pre-compute which factors will be required most frequently and to sort the loop in decreasing order by this frequency.
It is conceivable that a factor of 10 to 20 in this part of the computation could result from this kind of optimisation.
Another possible optimisation is to also have a different kind of threading setup for the last part of the computation, such that frequently used data can be reused by threads which share their L1 cache.
The factor generation, while not ideal, is actually reasonably efficient, especially when compared to the final factor multiplication, especially in the case when there are many momentum combinations. This is because the side lengths are generally large enough for an isolated Eigen multiplication to be not much slower than some more optimal hand-written kernel.
The multiplications and traces carried out in the final stage of the computation are highly inefficient.
This is because these are many thousands of very small matrix multiplications which should ideally be reshuffled into a few multiplications of very large matrices. At the very least, the involved matrices should have side lengths of a few hundred elements, such that they can run efficiently.
Another option might be to implement this final multiplication by hand. One would store the factors as custom AoSoA rather than Eigen matrices, splitting the real and imaginary parts as is usually done to achieve high vectorisation efficiency for complex multiplications. The usage of a blocked custom data structure would allow explicit prefetches to be issued (load instructions for the next small matrix between computations for the current small matrix) and the computation could further be optimised to maximize cache reuse (by a single thread). The latter would require to pre-compute which factors will be required most frequently and to sort the loop in decreasing order by this frequency. It is conceivable that a factor of 10 to 20 in this part of the computation could result from this kind of optimisation.
Another possible optimisation is to also have a different kind of threading setup for the last part of the computation, such that frequently used data can be reused by threads which share their L1 cache.
The factor generation, while not ideal, is actually reasonably efficient, especially when compared to the final factor multiplication, especially in the case when there are many momentum combinations. This is because the side lengths are generally large enough for an isolated Eigen multiplication to be not much slower than some more optimal hand-written kernel.