I ran into this issue when testing cp2k/cp2k#581 which is mainly based on DBCSR tensors. After contraction of large tensors, DBCSR multiplication of much smaller matrices takes much more time than with the reference methods that does not use tensors. This must be connected to the memory pools because this inefficiency disappears when I call dbcsr_clear_mempools() after all tensor contractions are done.
Here some timings output:
Timings without calling dbcsr_clear_mempools():
SUBROUTINE CALLS ASD SELF TIME TOTAL TIME
MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM
multiply_cannon_multrec 3051 13.6 15.324 21.590 26.355 31.443
dbcsr_mm_multrec_finalize 280 14.9 0.003 0.004 11.031 15.310
dbcsr_mm_sched_finalize 280 15.9 10.981 15.263 10.981 15.263
Timings with calling dbcsr_clear_mempools():
SUBROUTINE CALLS ASD SELF TIME TOTAL TIME
MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM
multiply_cannon_multrec 3051 13.6 11.731 12.871 11.781 12.919
dbcsr_mm_multrec_finalize 280 14.9 0.002 0.003 0.049 0.125
dbcsr_mm_sched_finalize 280 15.9 0.001 0.101 0.001 0.101
I'm testing the code with pure MPI (no GPUs and 1 thread per MPI rank) on mc partition of Piz Daint. I can share CP2K input once cp2k/cp2k#581 is merged.
Another obvious issue with memory pools is that memory stays reserved for DBCSR mm but may be needed for other operations. For instance, the memory bottleneck of the DBCSR tensor implementation is copying/redistribution operations after matrix multiplication, but only if memory pools are not cleared after matrix multiplication.
probably because some operations use the full size of the memory buffer, whereas they should be using the size of the data? Might be a good testcase to figure out.
I ran into this issue when testing cp2k/cp2k#581 which is mainly based on DBCSR tensors. After contraction of large tensors, DBCSR multiplication of much smaller matrices takes much more time than with the reference methods that does not use tensors. This must be connected to the memory pools because this inefficiency disappears when I call
dbcsr_clear_mempools()
after all tensor contractions are done.Here some timings output:
Timings without calling
dbcsr_clear_mempools()
:Timings with calling
dbcsr_clear_mempools()
:I'm testing the code with pure MPI (no GPUs and 1 thread per MPI rank) on mc partition of Piz Daint. I can share CP2K input once cp2k/cp2k#581 is merged.
Another obvious issue with memory pools is that memory stays reserved for DBCSR mm but may be needed for other operations. For instance, the memory bottleneck of the DBCSR tensor implementation is copying/redistribution operations after matrix multiplication, but only if memory pools are not cleared after matrix multiplication.