Closed drwells closed 1 year ago
Another option (which is probably better) is to just do all communication in the first and last and make the inner step purely computational.
Suppose we use nodal coupling. In that case each processor should end up with about the same number of nodes in both data partitionings. Hence the amount of communication from process A to process B is arbitrarily sized but the sum of all such communications should roughly equal across all processors (i.e., each processor has roughly the same amount of total data to send and receive). This implies that we should do all the communication in the first and third chunks so that all three are individually balanced.
While that's all true, it looks like this doesn't fully resolve the issue - maybe the older version is actually faster? I need to figure out a way to wait on a general set of vector updates.
I think the answer is to do the scatters in this way but to do an MPI_Waitall()
on all of the scatters simultaneously: that way communication will have 100% finished before proceeding to the function that does the actual computations.
I thought about it a little more - the way to achieve this is to probably add
class TransactionBase
{
public:
virtual std::vector<MPI_Request>
get_outstanding_requests() const;
};
and something similar for Scatter
so that I can do an MPI_Waitall()
on all communications simultaneously. This shouldn't be too bad - I just need to use deal.II's Partitioner
class directly instead of going through LA::d::V
.
Doing
// wait for both sends and receives to complete, even though only
// receives are really necessary. this gives (much) better performance
AssertDimension(ghost_targets().size() + import_targets().size(),
requests.size());
if (requests.size() > 0)
{
int ierr =
MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);
AssertThrowMPI(ierr);
ierr =
MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);
AssertThrowMPI(ierr);
}
requests.resize(0);
in dealii::Utilities::MPI::Partitioner
works so I'm pretty sure this plan will work (the requests should be set to MPI_REQUEST_NULL
and then ignored in the second call) - I might want to double-check with another MPI implementation (mpich) to be sure.
It might be better to do all communication outside of the functions which actually compute.