Open zhongchen530 opened 5 days ago
What a great question @zhongchen530 - thank you.
I believe the SOS implementation is correct, but I understand the confusion. Here is how I think of the team split semantic in terms of your example above:
Yes, the split of SHMEM_TEAM_WORLD
into new_team
results in the following global PEs making up new_team
:
new_team : {0,2,4}
However, the OpenSHMEM specification says the following:
PEs in a newly created team are consecutively numbered starting with PE number 0. PEs are ordered by their PE number in the parent team. Team-relative PE numbers can be used for point-to-point operations through team-based contexts...
However, within the new_team
, the team-relative PE numbering is actually:
new_team : {0,1,2}
The specification example for shmem_team_create_ctx
illustrates how this works, and kinda how it can be useful.
As a example for demonstration, if you did the following:
shmem_team_create_ctx(new_team, 0, &new_ctx); // assume this is successful
if ( shmem_team_my_pe(new_team) == 1 ) {
shmem_put(new_ctx, dest, source, nelems, 2); // This means global PE 2 puts to PE 4 on the world team!
}
So with respect to the world team indexing, PE 2 does a put to PE 4. But with respect to the team-relative indexing, PE 1 in new_team
is doing a put to PE 2 in new_team
.
This means that in your example, the (start, stride, size) split of (0, 2, 2) will result in:
another_team = {0,4}
with respect to the world team, but actually within team-relative numbering it's still:
another_team = {0,1}
Does that help?
Please let me know if you know of any other OpenSHMEM implementations that do not work like this... I believe this is the correct behavior, but there may be room for improvement in how the specification explains this.
I'm adding a some more eyes who might be interested and can ensure I'm not mistaken: @wrrobin @lstewart @wokuno
@zhongchen530 - Oops! I had to write all that to see that maybe SOS does not behave how I described, and you found an issue. Will investigate..
@zhongchen530 - Oops! I had to write all that to see that maybe SOS does not behave how I described, and you found an issue. Will investigate..
Yes, your explanation above is what I initially expected, but it doesn't behave that way. Instead, another_team
is observed to be {0,2}
instead.
The function
SHMEM_TEAM_SPLIT_STRIDED
takes in a parent team, start, size, and stride argument to produce a new team. I would expect in this case the stride to be relative to the parent team. However, what I observed was that it expects the stride to be relative to the SHMEM_TEAM_WORLD instead.For instance if number of PEs is 6, meaning
SHMEM_TEAM_WORLD : {0,1,2,3,4,5}
.shmem_team_split_strided(SHMEM_TEAM_WORLD, 0, 2, 3, NULL, 0, &new_team);
would result in
new_team : {0,2,4}
where numbering is relative toSHMEM_TEAM_WORLD
If I again call
shmem_team_split_strided(new_team, 0, 2, 2, NULL, 0, &another_team);
this would result in
another_team : {0, 2}
where numbering is relative toSHMEM_TEAM_WORLD
The stride is still relative to the
SHMEM_TEAM_WORLD
team, not the parent teamnew_team
passed as argument into the function. If the stride was relative tonew_team
, we would expect{0,4}
instead.Is this an intended behavior or is it a bug?
All numbering used to denote a PE is relative to
SHMEM_TEAM_WORLD
.