Closed solomonik closed 8 years ago
Even for w3 cc-pVDZ using 2048 processes, it takes >100 seconds.
w3/ccpVDZ on 2048 processes is completely unreasonable. That runs trivially on my laptop with NWChem. Do we think 1eints is suffering because of large-scale parallelism or something else? It would be good to have a profile for a smaller number of processes.
Assuming this issue can be detected at 2 processes, you now have your best argument in favor of continuous integration ;-)
For w3 cc-pVDZ on 64 processes 1eints takes 0.236 seconds on Edison, so there is almost certainly a latency/synchronization performance bug.
This is definitely a bug. Sam and I have noticed this for a while and one of us will look into it.
On 1/11/16 7:48 AM, Edgar Solomonik wrote:
w3 cc-pVDZ on 64 processes takes 0.236 seconds on Edison, so there is almost certainly a latency/synchronization performance bug.
— Reply to this email directly or view it on GitHub https://github.com/devinamatthews/aquarius/issues/9#issuecomment-170554279.
This scaling bottleneck in 1eints actually seems due to an overhead in CTF initialization (topology creation) that happens the first time AQ defines a CTF tensor, which happens during 1eints. I wrote a bunch of profiling code to figure this out, and will integrate it into Aquarius (you will see it if building CTF with -DPROFILE but it may make sense for AQ to time it natively to avoid confusion, CTF start-up cost is nonzero at the moment). I will implement a more efficient algorithm for topology creation than the naive scheme currently used in CTF. I had not been profiling this part and missed it until now as initialization has most often not been included in timings.
When running w20 cc-pVDZ ccsd on 1024 processes, 6 threads per process on Edison, 1eints takes 8.4 seconds, in March 2013, it took 0.36. Subsequently for w25 cc-pVDZ ccsd on 4096 process, 6 threads per process on Edison, 1eints takes 1836 seconds, which seems completely unreasonable.