Open evaleev opened 8 years ago
Assuming there is a memory leak, the most likely place is in the parallel-reduce and/or the contract-reduce code. For single threaded execution, all computation is done in place. For threaded execution, threads will occasionally allocate extra temporary memory for tile contractions. Under conditions where there are many more result tiles than threads, the temporary memory allocation will happen rarely. In contrast, as the ratio result tiles to threads goes down, threads are more likely to allocate extra temporary memory.
This seems unlikely, since the memory is managed by a shared_ptr in the Tensor
class, but there could be a corner case that was missed. It might be useful to manually instrument allocation/deallocation in the tensor class.
Update : using jemalloc suggests the "leakage" is likely due to the internal caching by the malloc. jemalloc keeps twice as much memory than glibc malloc. However, the high watermark for the virtual memory is slightly lower with jemalloc than with glibc malloc.
start VmPeak: 60224 kB VmSize: 60224 kB VmLck: 0 kB VmHWM: 21232 kB VmRSS: 21232 kB VmData: 17316 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 144 kB VmSwap: 0 kB allocated a and b VmPeak: 1108800 kB VmSize: 1108800 kB VmLck: 0 kB VmHWM: 1069876 kB VmRSS: 1069876 kB VmData: 1065892 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 2192 kB VmSwap: 0 kB c=a*b VmPeak: 1647424 kB VmSize: 1647424 kB VmLck: 0 kB VmHWM: 1608940 kB VmRSS: 1608940 kB VmData: 1604516 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 3252 kB VmSwap: 0 kB stop VmPeak: 1647424 kB VmSize: 1647424 kB VmLck: 0 kB VmHWM: 1608948 kB VmRSS: 22240 kB VmData: 1604516 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 156 kB VmSwap: 0 kB
start VmPeak: 172908 kB VmSize: 172908 kB VmLck: 0 kB VmHWM: 23368 kB VmRSS: 23368 kB VmData: 130000 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 188 kB VmSwap: 0 kB allocated a and b VmPeak: 1225580 kB VmSize: 1225580 kB VmLck: 0 kB VmHWM: 1076112 kB VmRSS: 1076112 kB VmData: 1182672 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 2244 kB VmSwap: 0 kB c=a*b VmPeak: 1770348 kB VmSize: 1770348 kB VmLck: 0 kB VmHWM: 1621348 kB VmRSS: 1621348 kB VmData: 1727440 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 3316 kB VmSwap: 0 kB stop VmPeak: 1770348 kB VmSize: 1770348 kB VmLck: 0 kB VmHWM: 1621364 kB VmRSS: 36212 kB VmData: 1727440 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 220 kB VmSwap: 0 kB
start VmPeak: 172908 kB VmSize: 172908 kB VmLck: 0 kB VmHWM: 23368 kB VmRSS: 23368 kB VmData: 130000 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 184 kB VmSwap: 0 kB allocated a and b VmPeak: 1397612 kB VmSize: 1397612 kB VmLck: 0 kB VmHWM: 1248184 kB VmRSS: 1248184 kB VmData: 1354704 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 2576 kB VmSwap: 0 kB c=a*b VmPeak: 1989484 kB VmSize: 1989484 kB VmLck: 0 kB VmHWM: 1840616 kB VmRSS: 1840616 kB VmData: 1946576 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 3740 kB VmSwap: 0 kB stop VmPeak: 1989484 kB VmSize: 1989484 kB VmLck: 0 kB VmHWM: 1840620 kB VmRSS: 68088 kB VmData: 1946576 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10508 kB VmPTE: 3740 kB VmSwap: 0 kB
It seems that the 1-tile case seems to be the easiest place to unravel this. There should be no significant memory allocation here (the lone result tile is computed in a single gemm task and there should be very few futures created) . But weirdly, using 12 threads vs 1 thread increases the memory use. This is most noticeable when I use glibc malloc: VmSize after gemm is 2000816 kB with 12 threads, only 1636744 with 1 thread. In fact, just even after allocating a and b there is already substantial difference: 1279148 vs 1100924. This difference, ~180MB, is much smaller than the tile size (~600MB) but is substantial.
similar issues show up with jemalloc ... will profile mallocs with jemalloc, will see what this uncovers.
Both glibc and jemalloc preallocate and cache memory for performance reasons. 180 MB is not that much on a per-thread basis, only 15 MB per thread. This could be used for small buffer allocation, or some other type of optimization. Also it is well known that glibc malloc is not very good at managing fragmentation. Thus the heap allocated memory tends to grow the longer the application runs. Jemalloc on the other hand is better at managing fragmentation, but uses a lot of memory for book keeping.
For TA specifically, with only one tile per array there is no possibility of memory allocation for tile data. The only memory allocation is for tasks, and there should not be too many of those.
To test my assertions above, try increasing the number of iterations. The memory should grow with the number of iterations due to fragmentation with glibc and stay constant (after a certain threshold) with jemalloc.
Edit: there is no possibility that TA is allocating memory for temporary storage. Only the generated results tile will cause a memory allocation.
I have not look at the detail of this issue, but this might be related https://github.com/ValeevGroup/tiledarray/commit/7429d81248e6f012d1d39b177a1ad2ebcc326be0
The TA_SUMMA_DEPTH
does not correctly set the depth in SUMMA
Intro
instrumented simple memory tracing in ta_dense. all tests on 1 12-core linux box. TBB is OFF, MKL is ON. execute
./ta_dense 8192 128 1
. Using default malloc (i.e. not tbbmalloc or tcmalloc).1 thread, SUMMA depth = 1
near-perfect memory cleanup (VmData), reproducible traces, highwater virtual memory mark (VmPeak) "exactly" equals the memory needed for 3 arrays
run 1
start VmPeak: 51896 kB VmSize: 51896 kB VmLck: 0 kB VmHWM: 11564 kB VmRSS: 11564 kB VmData: 11400 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 132 kB VmSwap: 0 kB allocated a and b VmPeak: 1139088 kB VmSize: 1139088 kB VmLck: 0 kB VmHWM: 1099004 kB VmRSS: 1099004 kB VmData: 1098592 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 2252 kB VmSwap: 0 kB c=a*b VmPeak: 1670940 kB VmSize: 1669944 kB VmLck: 0 kB VmHWM: 1631112 kB VmRSS: 1630132 kB VmData: 1629448 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 3296 kB VmSwap: 0 kB stop VmPeak: 1670940 kB VmSize: 60672 kB VmLck: 0 kB VmHWM: 1631112 kB VmRSS: 21012 kB VmData: 20176 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 156 kB VmSwap: 0 kB
run 2
start VmPeak: 51896 kB VmSize: 51896 kB VmLck: 0 kB VmHWM: 11568 kB VmRSS: 11568 kB VmData: 11400 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 132 kB VmSwap: 0 kB allocated a and b VmPeak: 1139088 kB VmSize: 1139088 kB VmLck: 0 kB VmHWM: 1099008 kB VmRSS: 1099008 kB VmData: 1098592 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 2256 kB VmSwap: 0 kB c=a*b VmPeak: 1670940 kB VmSize: 1669944 kB VmLck: 0 kB VmHWM: 1631116 kB VmRSS: 1630136 kB VmData: 1629448 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 3300 kB VmSwap: 0 kB stop VmPeak: 1670940 kB VmSize: 60672 kB VmLck: 0 kB VmHWM: 1631116 kB VmRSS: 21016 kB VmData: 20176 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 156 kB VmSwap: 0 kB
run 3
start VmPeak: 51896 kB VmSize: 51896 kB VmLck: 0 kB VmHWM: 11564 kB VmRSS: 11564 kB VmData: 11400 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 132 kB VmSwap: 0 kB allocated a and b VmPeak: 1139088 kB VmSize: 1139088 kB VmLck: 0 kB VmHWM: 1099004 kB VmRSS: 1099004 kB VmData: 1098592 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 2256 kB VmSwap: 0 kB c=a*b VmPeak: 1670940 kB VmSize: 1669944 kB VmLck: 0 kB VmHWM: 1631112 kB VmRSS: 1630132 kB VmData: 1629448 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 3300 kB VmSwap: 0 kB stop VmPeak: 1670940 kB VmSize: 60672 kB VmLck: 0 kB VmHWM: 1631112 kB VmRSS: 21012 kB VmData: 20176 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 160 kB VmSwap: 0 kB
12 threads, SUMMA depth = 1
significant amount of memory (>800M) is not released (VmData at "stop"), and VmPeak >> the size of 3 matrices. However, VmHWM (highwater mark for resident memory) looks fine. So the extra virtual memory used was never mapped into physical memory ???
run 1
start VmPeak: 164584 kB VmSize: 164584 kB VmLck: 0 kB VmHWM: 11664 kB VmRSS: 11664 kB VmData: 124088 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 168 kB VmSwap: 0 kB allocated a and b VmPeak: 1971092 kB VmSize: 1971092 kB VmLck: 0 kB VmHWM: 1100112 kB VmRSS: 1100112 kB VmData: 1930596 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 2340 kB VmSwap: 0 kB c=a*b VmPeak: 2039812 kB VmSize: 2038360 kB VmLck: 0 kB VmHWM: 1635968 kB VmRSS: 1632852 kB VmData: 1997864 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 3396 kB VmSwap: 0 kB stop VmPeak: 2039812 kB VmSize: 889636 kB VmLck: 0 kB VmHWM: 1635968 kB VmRSS: 21740 kB VmData: 849140 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 1148 kB VmSwap: 0 kB
run 2
start VmPeak: 164584 kB VmSize: 164584 kB VmLck: 0 kB VmHWM: 11664 kB VmRSS: 11664 kB VmData: 124088 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 172 kB VmSwap: 0 kB allocated a and b VmPeak: 1971224 kB VmSize: 1971224 kB VmLck: 0 kB VmHWM: 1100200 kB VmRSS: 1100200 kB VmData: 1930728 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 2344 kB VmSwap: 0 kB c=a*b VmPeak: 2033540 kB VmSize: 2033384 kB VmLck: 0 kB VmHWM: 1635772 kB VmRSS: 1632664 kB VmData: 1992888 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 3400 kB VmSwap: 0 kB stop VmPeak: 2033540 kB VmSize: 889420 kB VmLck: 0 kB VmHWM: 1635772 kB VmRSS: 23052 kB VmData: 848924 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 1164 kB VmSwap: 0 kB
run 3
start VmPeak: 164584 kB VmSize: 164584 kB VmLck: 0 kB VmHWM: 11668 kB VmRSS: 11668 kB VmData: 124088 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 172 kB VmSwap: 0 kB allocated a and b VmPeak: 1971224 kB VmSize: 1971224 kB VmLck: 0 kB VmHWM: 1100176 kB VmRSS: 1100176 kB VmData: 1930728 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 2340 kB VmSwap: 0 kB c=a*b VmPeak: 2018980 kB VmSize: 2018824 kB VmLck: 0 kB VmHWM: 1635192 kB VmRSS: 1632056 kB VmData: 1978328 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 3404 kB VmSwap: 0 kB stop VmPeak: 2018980 kB VmSize: 889732 kB VmLck: 0 kB VmHWM: 1635192 kB VmRSS: 21616 kB VmData: 849236 kB VmStk: 96 kB VmExe: 11284 kB VmLib: 10156 kB VmPTE: 1192 kB VmSwap: 0 kB
1 thread, default SUMMA depth
same as with SUMMA depth = 1: no leakage
2 threads, default SUMMA depth
some "leakage"
run 1
start VmPeak: 62144 kB VmSize: 62144 kB VmLck: 0 kB VmHWM: 11568 kB VmRSS: 11568 kB VmData: 21644 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 136 kB VmSwap: 0 kB allocated a and b VmPeak: 1214744 kB VmSize: 1214744 kB VmLck: 0 kB VmHWM: 1099644 kB VmRSS: 1099644 kB VmData: 1174244 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 2264 kB VmSwap: 0 kB c=a*b VmPeak: 1704760 kB VmSize: 1697052 kB VmLck: 0 kB VmHWM: 1632012 kB VmRSS: 1629996 kB VmData: 1656552 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 3312 kB VmSwap: 0 kB stop VmPeak: 1704760 kB VmSize: 135236 kB VmLck: 0 kB VmHWM: 1632012 kB VmRSS: 20952 kB VmData: 94736 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 288 kB VmSwap: 0 kB
4 threads, default SUMMA depth
more "leakage" than with 2 threads
run 1
start VmPeak: 82632 kB VmSize: 82632 kB VmLck: 0 kB VmHWM: 11588 kB VmRSS: 11588 kB VmData: 42132 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 136 kB VmSwap: 0 kB allocated a and b VmPeak: 1366040 kB VmSize: 1366040 kB VmLck: 0 kB VmHWM: 1100028 kB VmRSS: 1100028 kB VmData: 1325540 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 2276 kB VmSwap: 0 kB c=a*b VmPeak: 1715268 kB VmSize: 1715112 kB VmLck: 0 kB VmHWM: 1633676 kB VmRSS: 1631828 kB VmData: 1674612 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 3332 kB VmSwap: 0 kB stop VmPeak: 1715268 kB VmSize: 287060 kB VmLck: 0 kB VmHWM: 1633676 kB VmRSS: 22224 kB VmData: 246560 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 548 kB VmSwap: 0 kB
8 threads, default SUMMA depth
more "leakage" than with 4 threads
run 1
start VmPeak: 123616 kB VmSize: 123616 kB VmLck: 0 kB VmHWM: 11624 kB VmRSS: 11624 kB VmData: 83116 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 160 kB VmSwap: 0 kB allocated a and b VmPeak: 1668636 kB VmSize: 1668636 kB VmLck: 0 kB VmHWM: 1100088 kB VmRSS: 1100088 kB VmData: 1628136 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 2312 kB VmSwap: 0 kB c=a*b VmPeak: 2078540 kB VmSize: 1947640 kB VmLck: 0 kB VmHWM: 1634656 kB VmRSS: 1632172 kB VmData: 1907140 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 3364 kB VmSwap: 0 kB stop VmPeak: 2078540 kB VmSize: 586404 kB VmLck: 0 kB VmHWM: 1634656 kB VmRSS: 20184 kB VmData: 545904 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 1060 kB VmSwap: 0 kB
12 threads, default SUMMA depth
same as with 12 threads and SUMMA depth = 1: extensive leakage
run 1
start VmPeak: 164588 kB VmSize: 164588 kB VmLck: 0 kB VmHWM: 11648 kB VmRSS: 11648 kB VmData: 124088 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 172 kB VmSwap: 0 kB allocated a and b VmPeak: 1971096 kB VmSize: 1971096 kB VmLck: 0 kB VmHWM: 1100112 kB VmRSS: 1100112 kB VmData: 1930596 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 2344 kB VmSwap: 0 kB c=a*b VmPeak: 2031880 kB VmSize: 2031724 kB VmLck: 0 kB VmHWM: 1637608 kB VmRSS: 1633988 kB VmData: 1991224 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 3408 kB VmSwap: 0 kB stop VmPeak: 2031880 kB VmSize: 889612 kB VmLck: 0 kB VmHWM: 1637608 kB VmRSS: 22728 kB VmData: 849112 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 1180 kB VmSwap: 0 kB
1 thread, default SUMMA depth, tilesize=8192 (each matrix has only 1 tile):
same results as with tiled matrices
run 1
start VmPeak: 51956 kB VmSize: 51956 kB VmLck: 0 kB VmHWM: 11596 kB VmRSS: 11596 kB VmData: 11456 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 136 kB VmSwap: 0 kB allocated a and b VmPeak: 1100924 kB VmSize: 1100924 kB VmLck: 0 kB VmHWM: 1060636 kB VmRSS: 1060636 kB VmData: 1060424 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 2184 kB VmSwap: 0 kB c=a*b VmPeak: 1636744 kB VmSize: 1626180 kB VmLck: 0 kB VmHWM: 1596796 kB VmRSS: 1586320 kB VmData: 1585680 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 3220 kB VmSwap: 0 kB stop VmPeak: 1636744 kB VmSize: 53304 kB VmLck: 0 kB VmHWM: 1596796 kB VmRSS: 13448 kB VmData: 12804 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 148 kB VmSwap: 0 kB
12 threads, default SUMMA depth, tilesize=8192 (each matrix has only 1 tile)
Some "leakage" still ...
run 1
start VmPeak: 164644 kB VmSize: 164644 kB VmLck: 0 kB VmHWM: 11696 kB VmRSS: 11696 kB VmData: 124144 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 172 kB VmSwap: 0 kB allocated a and b VmPeak: 1279148 kB VmSize: 1279148 kB VmLck: 0 kB VmHWM: 1060744 kB VmRSS: 1060744 kB VmData: 1238648 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 2236 kB VmSwap: 0 kB c=a*b VmPeak: 2011380 kB VmSize: 2000816 kB VmLck: 0 kB VmHWM: 1596932 kB VmRSS: 1586464 kB VmData: 1960316 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 3284 kB VmSwap: 0 kB stop VmPeak: 2011380 kB VmSize: 427940 kB VmLck: 0 kB VmHWM: 1596932 kB VmRSS: 13592 kB VmData: 387440 kB VmStk: 96 kB VmExe: 11288 kB VmLib: 10156 kB VmPTE: 204 kB VmSwap: 0 kB
Analysis
possibilities