Consider using 5, instead of 3, stages for IB stuff

drwells commented 1 year ago

It might be better to do all communication outside of the functions which actually compute.

drwells commented 1 year ago

Another option (which is probably better) is to just do all communication in the first and last and make the inner step purely computational.

Suppose we use nodal coupling. In that case each processor should end up with about the same number of nodes in both data partitionings. Hence the amount of communication from process A to process B is arbitrarily sized but the sum of all such communications should roughly equal across all processors (i.e., each processor has roughly the same amount of total data to send and receive). This implies that we should do all the communication in the first and third chunks so that all three are individually balanced.

drwells commented 1 year ago

While that's all true, it looks like this doesn't fully resolve the issue - maybe the older version is actually faster? I need to figure out a way to wait on a general set of vector updates.

drwells commented 1 year ago

I think the answer is to do the scatters in this way but to do an MPI_Waitall() on all of the scatters simultaneously: that way communication will have 100% finished before proceeding to the function that does the actual computations.

drwells commented 1 year ago

I thought about it a little more - the way to achieve this is to probably add

class TransactionBase
{
public:
    virtual std::vector<MPI_Request>
    get_outstanding_requests() const;
};

and something similar for Scatter so that I can do an MPI_Waitall() on all communications simultaneously. This shouldn't be too bad - I just need to use deal.II's Partitioner class directly instead of going through LA::d::V.

Doing


      // wait for both sends and receives to complete, even though only
      // receives are really necessary. this gives (much) better performance
      AssertDimension(ghost_targets().size() + import_targets().size(),
                      requests.size());
      if (requests.size() > 0)
        {
          int ierr =
            MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);
          AssertThrowMPI(ierr);
          ierr =
            MPI_Waitall(requests.size(), requests.data(), MPI_STATUSES_IGNORE);
          AssertThrowMPI(ierr);
        }
      requests.resize(0);

in dealii::Utilities::MPI::Partitioner works so I'm pretty sure this plan will work (the requests should be set to MPI_REQUEST_NULL and then ignored in the second call) - I might want to double-check with another MPI implementation (mpich) to be sure.

drwells / fiddle

Consider using 5, instead of 3, stages for IB stuff #148