Decrease memory requirement of large MPI sparse runs

Memory requirements of sparse mode is generally smaller than that of the FFT 
mode. But the sparse mode requires one to store a couple of "full" 
(non-distributed) arrays (with information for all dipoles) in MPI mode, like 
position_full and arg_full. As a result memory requirements are not favorable 
for large MPI runs in spare mode.

This can be fixed by the following procedure. Each processor stores only part 
of position (p_i), or more generally dipole coordinates, and argvec (x_i), 
i=1..n (n - number of processors). The processes communicate in pairs (similar 
to block-transpose for the FFT-part) - total n-1 communication cycles. A pair 
of processors i and j interact as the following:
1) exchange p_i and p_j (store them in buffers)
2) knowing p_i and p_j, processor i computes y_j=A_ji(p_j,p_i).x_i and stores 
it in buffer. Analogous operation for processor j.
3) exchange y_j and y_i. Received vector is added to the resulting vector on 
corresponding processor.

At some point each processor should also compute A_ii(p_i,p_i).x_i and add it 
to the result. If n is even this can be done in a separate cycle (for all 
processors), otherwise (odd n) each processor does it when it has no pair for 
exchange.

Overall, the memory requirement of each processor will be proportional to 
number of dipoles on that processor (perfect scaling). The cost for that is 
increased communications: two global cycles instead of one AllGather. In terms 
of operation count, AllGather is similar to one global cycle but in reality it 
is probably significantly faster due to possible optimizations by MPI backend.

Original issue reported on code.google.com by yurkin on 5 Feb 2013 at 3:15

DeFrogxX / a-dda

Decrease memory requirement of large MPI sparse runs #160