Memory requirements of sparse mode is generally smaller than that of the FFT
mode. But the sparse mode requires one to store a couple of "full"
(non-distributed) arrays (with information for all dipoles) in MPI mode, like
position_full and arg_full. As a result memory requirements are not favorable
for large MPI runs in spare mode.
This can be fixed by the following procedure. Each processor stores only part
of position (p_i), or more generally dipole coordinates, and argvec (x_i),
i=1..n (n - number of processors). The processes communicate in pairs (similar
to block-transpose for the FFT-part) - total n-1 communication cycles. A pair
of processors i and j interact as the following:
1) exchange p_i and p_j (store them in buffers)
2) knowing p_i and p_j, processor i computes y_j=A_ji(p_j,p_i).x_i and stores
it in buffer. Analogous operation for processor j.
3) exchange y_j and y_i. Received vector is added to the resulting vector on
corresponding processor.
At some point each processor should also compute A_ii(p_i,p_i).x_i and add it
to the result. If n is even this can be done in a separate cycle (for all
processors), otherwise (odd n) each processor does it when it has no pair for
exchange.
Overall, the memory requirement of each processor will be proportional to
number of dipoles on that processor (perfect scaling). The cost for that is
increased communications: two global cycles instead of one AllGather. In terms
of operation count, AllGather is similar to one global cycle but in reality it
is probably significantly faster due to possible optimizations by MPI backend.
Original issue reported on code.google.com by yurkin on 5 Feb 2013 at 3:15
Original issue reported on code.google.com by
yurkin
on 5 Feb 2013 at 3:15