devinamatthews / aquarius

Aquarius is a parallel quantum chemistry package built on the Cyclops Tensor Framework which provides high-performance structured tensor operations. Aquarius is primarily focused on iterative methods such as CC, CI, and EOMCC.
BSD 3-Clause "New" or "Revised" License
27 stars 11 forks source link

1eints performance #9

Closed solomonik closed 8 years ago

solomonik commented 8 years ago

When running w20 cc-pVDZ ccsd on 1024 processes, 6 threads per process on Edison, 1eints takes 8.4 seconds, in March 2013, it took 0.36. Subsequently for w25 cc-pVDZ ccsd on 4096 process, 6 threads per process on Edison, 1eints takes 1836 seconds, which seems completely unreasonable.

solomonik commented 8 years ago

Even for w3 cc-pVDZ using 2048 processes, it takes >100 seconds.

jeffhammond commented 8 years ago

w3/ccpVDZ on 2048 processes is completely unreasonable. That runs trivially on my laptop with NWChem. Do we think 1eints is suffering because of large-scale parallelism or something else? It would be good to have a profile for a smaller number of processes.

Assuming this issue can be detected at 2 processes, you now have your best argument in favor of continuous integration ;-)

solomonik commented 8 years ago

For w3 cc-pVDZ on 64 processes 1eints takes 0.236 seconds on Edison, so there is almost certainly a latency/synchronization performance bug.

devinamatthews commented 8 years ago

This is definitely a bug. Sam and I have noticed this for a while and one of us will look into it.

On 1/11/16 7:48 AM, Edgar Solomonik wrote:

w3 cc-pVDZ on 64 processes takes 0.236 seconds on Edison, so there is almost certainly a latency/synchronization performance bug.

— Reply to this email directly or view it on GitHub https://github.com/devinamatthews/aquarius/issues/9#issuecomment-170554279.

solomonik commented 8 years ago

This scaling bottleneck in 1eints actually seems due to an overhead in CTF initialization (topology creation) that happens the first time AQ defines a CTF tensor, which happens during 1eints. I wrote a bunch of profiling code to figure this out, and will integrate it into Aquarius (you will see it if building CTF with -DPROFILE but it may make sense for AQ to time it natively to avoid confusion, CTF start-up cost is nonzero at the moment). I will implement a more efficient algorithm for topology creation than the naive scheme currently used in CTF. I had not been profiling this part and missed it until now as initialization has most often not been included in timings.