Inefficiencies with large memory pools

I ran into this issue when testing cp2k/cp2k#581 which is mainly based on DBCSR tensors. After contraction of large tensors, DBCSR multiplication of much smaller matrices takes much more time than with the reference methods that does not use tensors. This must be connected to the memory pools because this inefficiency disappears when I call dbcsr_clear_mempools() after all tensor contractions are done.

Here some timings output:

Timings without calling dbcsr_clear_mempools():

 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
multiply_cannon_multrec           3051 13.6   15.324   21.590   26.355   31.443
dbcsr_mm_multrec_finalize          280 14.9    0.003    0.004   11.031   15.310
dbcsr_mm_sched_finalize            280 15.9   10.981   15.263   10.981   15.263

Timings with calling dbcsr_clear_mempools():

 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
multiply_cannon_multrec           3051 13.6   11.731   12.871   11.781   12.919
dbcsr_mm_multrec_finalize          280 14.9    0.002    0.003    0.049    0.125
dbcsr_mm_sched_finalize            280 15.9    0.001    0.101    0.001    0.101

I'm testing the code with pure MPI (no GPUs and 1 thread per MPI rank) on mc partition of Piz Daint. I can share CP2K input once cp2k/cp2k#581 is merged.

Another obvious issue with memory pools is that memory stays reserved for DBCSR mm but may be needed for other operations. For instance, the memory bottleneck of the DBCSR tensor implementation is copying/redistribution operations after matrix multiplication, but only if memory pools are not cleared after matrix multiplication.

cp2k / dbcsr

Inefficiencies with large memory pools #242