Closed ryanmrichard closed 1 year ago
I guess this is r2g. This is everyone's "speak now or forever hold your peace" moment.
On the page https://nwchemex-project.github.io/ParallelZone/developer/design/runtime_view.html#proposed-apis the proposed API is templated on RAM. I didn't see that anywhere else. Is this information outdated?
I am a bit confused about the point-to-point communications in https://nwchemex-project.github.io/ParallelZone/developer/design/ram.html#proposed-apis. In the example code all ranks call rt.at(1).ram().send(data)
. Only the sender, rank 0, has filled the optional data
object, and rank 1 is identified in the call as the destination, so obviously this can work (all other ranks simply do nothing). What is not clear from the example is whether every rank has to call the send. I would think that in a proper parallel code only the pair of ranks that are communicating need to know that communication is happening. It should be OK for all other ranks to have no knowledge about that communication. So this should also work:
auto rt = get_runtime();
std::optional<decltype(fill_in_data())> data;
std::optional<decltype(fill_in_data())> output; // sending information should not change the data type
if(me == rt.at(0)){
data.emplace(fill_in_data());
}
if(me == rt.at(0) || me == rt.at(1)){
output = rt.at(1).ram().send(data);
}
if(output.has_value()){
// This part is only run by rank 1
// Do stuff with output
}
My concern is particularly related to algorithms that get data from multiple places at the same time. For example, consider a matrix-matrix multiplication C = A * B
. Where all three matrices are distributed. If a given rank computes part of C
, then that rank needs to receive blocks of A
and B
from other ranks. But when the data is received you have to know whether it is a block of A
or a block of B
. With the API as currently described in the example I gave that cannot be guaranteed. This would suggest that the send
has to be a collective operation where the synchronization guarantees that you can match sender and receiver. My concern with this is that it could have serious performance implications. Am I missing something?
On the page https://nwchemex-project.github.io/ParallelZone/developer/design/runtime_view.html#proposed-apis the proposed API is templated on RAM. I didn't see that anywhere else. Is this information outdated?
I removed the templating and updated the design to reflect what I was going for.
.. It should be OK for all other ranks to have no knowledge about that communication. So this should also work...
That's the logic which needs to be under the hood to support the SIMD API, so I would assume what you have would work too; that said, point-to-point is not actually coded up so I can't say "it does work". The goal is to have as SIMD-like of an API as possible, so I hesitate to require what you have to work, as it means we must also support a MIMD-like API.
...This would suggest that the send has to be a collective operation where the synchronization guarantees that you can match sender and receiver. My concern with this is that it could have serious performance implications. Am I missing something?
At present these functions just wrap the underlying MPI_send
and MPI_recv
and would have all the same restrictions as the MPI calls. Are you looking for these calls to expose the tag fields? I updated the call to show a tag. Again send/receive are not actually implemented in PZ, so I can't state with any certainty that these calls avoid the problems you mention, but presumably if the do not avoid the problems you mention it will be because the underlying MPI calls themselves don't.
The purpose of this PR is to establish the 1.0.0 version of ParallelZone. A 1.0.0 version is needed so that we can begin making 1.0.0 versions of the remainder of the stack. This PR also represents the last chance for the organization to weigh in on ParallelZone APIs before they go live.
TODOs: