Aquarius is a parallel quantum chemistry package built on the Cyclops Tensor Framework which provides high-performance structured tensor operations. Aquarius is primarily focused on iterative methods such as CC, CI, and EOMCC.
Memory usage spikes during calculation of aomoints, outside of CTF write calls (which are now buffered based on available memory). From my email on 2.10.2015,
When running w15-cc-pVDZ on 384 cores with 1 process/core (which used to work in the past), my debugging output is telling me the following (first explanation then output below).
Here tot_mem_used keeps track of how much memory is used to store all CTF tensors on processor 0. The code successfully executes the write on line 920 of aomoints.cxx, but crashes immediately thereafter. Previously, the crash was inside the write, but with buffering the write works. However, the memory used is for some reason really high when the write occurs. It seems to be that Aquarius allocates something of size 1.5GB, while the write is of size (max over all processes) of .27 GB. My guess is that aomoints then runs out of memory right after the write, but I am not sure (getting a valgrind trace would take forever for this).
...
tot_mem_used = 4.61970E+08/5.20561E+08, proc_bytes_available() = 1.59376E+09
tot_mem_used = 4.68126E+08/5.26718E+08, proc_bytes_available() = 1.58760E+09
Performing write of 102600 (max 102600) elements (max mem 1.6E+06) in 1 parts 1.58550E+09 memory available, 5.28815E+08 used
max received elements is 270, mine are 270
Completed write of 102600 elements
Performing write of 102600 (max 102600) elements (max mem 1.6E+06) in 1 parts 1.58468E+09 memory available, 5.29636E+08 used
max received elements is 270, mine are 270
Completed write of 102600 elements
Performing write of 21600 (max 21600) elements (max mem 3.5E+05) in 1 parts 1.58543E+09 memory available, 5.28884E+08 used
max received elements is 60, mine are 60
Completed write of 21600 elements
Performing write of 21600 (max 21600) elements (max mem 3.5E+05) in 1 parts 1.58526E+09 memory available, 5.29057E+08 used
max received elements is 60, mine are 60
Completed write of 21600 elements
... // printfs from aomoints.cxx:920 here, most processes writing about 1.7M elements.
Performing write of 0 (max 17099072) elements (max mem 2.7E+08) in 4088 parts 4.87430E+07 memory available, 2.06557E+09 used
Completed write of 0 elements
// segfault here
aomoints is also clearly using more memory than before
For instance, in 2014 for w20 cc-pVDZ on 256 nodes of Edison with 1024 processes 6 threads/process,
Wed Mar 26 02:05:31 2014: Starting task: aomoints
Wed Mar 26 02:06:17 2014: Finished task: aomoints in 46.402 s
Wed Mar 26 02:06:17 2014: Task: aomoints achieved 5203.736 Gflops/sec
and now
Sun Jan 10 20:25:49 2016: Starting task: aomoints
Sun Jan 10 20:26:46 2016: Finished task: aomoints in 56.492 s
Sun Jan 10 20:26:46 2016: Task: aomoints achieved 1.133 Gflops/sec
Memory usage spikes during calculation of aomoints, outside of CTF write calls (which are now buffered based on available memory). From my email on 2.10.2015,
When running w15-cc-pVDZ on 384 cores with 1 process/core (which used to work in the past), my debugging output is telling me the following (first explanation then output below).
Here tot_mem_used keeps track of how much memory is used to store all CTF tensors on processor 0. The code successfully executes the write on line 920 of aomoints.cxx, but crashes immediately thereafter. Previously, the crash was inside the write, but with buffering the write works. However, the memory used is for some reason really high when the write occurs. It seems to be that Aquarius allocates something of size 1.5GB, while the write is of size (max over all processes) of .27 GB. My guess is that aomoints then runs out of memory right after the write, but I am not sure (getting a valgrind trace would take forever for this).
... tot_mem_used = 4.61970E+08/5.20561E+08, proc_bytes_available() = 1.59376E+09 tot_mem_used = 4.68126E+08/5.26718E+08, proc_bytes_available() = 1.58760E+09 Performing write of 102600 (max 102600) elements (max mem 1.6E+06) in 1 parts 1.58550E+09 memory available, 5.28815E+08 used max received elements is 270, mine are 270 Completed write of 102600 elements Performing write of 102600 (max 102600) elements (max mem 1.6E+06) in 1 parts 1.58468E+09 memory available, 5.29636E+08 used max received elements is 270, mine are 270 Completed write of 102600 elements Performing write of 21600 (max 21600) elements (max mem 3.5E+05) in 1 parts 1.58543E+09 memory available, 5.28884E+08 used max received elements is 60, mine are 60 Completed write of 21600 elements Performing write of 21600 (max 21600) elements (max mem 3.5E+05) in 1 parts 1.58526E+09 memory available, 5.29057E+08 used max received elements is 60, mine are 60 Completed write of 21600 elements ... // printfs from aomoints.cxx:920 here, most processes writing about 1.7M elements. Performing write of 0 (max 17099072) elements (max mem 2.7E+08) in 4088 parts 4.87430E+07 memory available, 2.06557E+09 used Completed write of 0 elements // segfault here
aomoints is also clearly using more memory than before
For instance, in 2014 for w20 cc-pVDZ on 256 nodes of Edison with 1024 processes 6 threads/process,
Wed Mar 26 02:05:31 2014: Starting task: aomoints Wed Mar 26 02:06:17 2014: Finished task: aomoints in 46.402 s Wed Mar 26 02:06:17 2014: Task: aomoints achieved 5203.736 Gflops/sec
and now
Sun Jan 10 20:25:49 2016: Starting task: aomoints Sun Jan 10 20:26:46 2016: Finished task: aomoints in 56.492 s Sun Jan 10 20:26:46 2016: Task: aomoints achieved 1.133 Gflops/sec