Closed PhilMiller closed 5 years ago
This is a real bug. Looks like the value of the reduction is wrong sometimes (extracting this from the above output)!
vt/examples/../tutorial/tutorial_1h.h:63:
void vt::tutorial::ReduceResult::operator()(vt::tutorial::ReduceDataMsg*):
Assertion `num_nodes * 50 == msg->getConstVal()' failed.
Meriadeg: can you look into this issue and try to fix this bug?
@mperrinel Have you had a chance to look into this bug?
@PhilMiller @lifflander I could reproduce it on 16 ranks with oversubscription. I made several tests to figure out the root cause of the bug. Regarding the output,
test_collective_reduce
to check potential failure cases.
In fact there were no assertions on expected values in any test, just only debug prints.
So I updated the fixture by adding these missing assertions in first place.
enum struct ReduceOP : int {
Plus = 0,
Max = 1,
Min = 2
};
template
void operator()(SysMsg* msg) { auto const value = msg->getConstVal(); auto const n = vt::theContext()->getNumNodes();
switch (oper) {
case ReduceOP::Plus: EXPECT_EQ(value, n * (n - 1)/2); break;
case ReduceOP::Min: EXPECT_EQ(value, 0); break;
case ReduceOP::Max: EXPECT_EQ(value, n - 1); break;
default: vtAbort("Failure: should not be reached"); break;
}
} };
I ran the tests several times and everything is fine even on 24 ranks.
- I noticed that the example in `tutorial_1h.h` is pretty similar to `test_reduce_plus_default_op`.
I was wondering why it behaves differently though.
I changed the default builtin plus handler to a custom one:
```c++
static inline void reducePlus(ReduceDataMsg* msg) {
if (msg->isRoot()) {
auto const n = ::vt::theContext()->getNumNodes();
assert(n * 50 == msg->getConstVal());
} else {
auto* fst_msg = msg;
auto* cur_msg = msg->getNext<ReduceDataMsg>();
while (cur_msg not_eq nullptr) {
fst_msg->getVal() += cur_msg->getConstVal();
cur_msg = cur_msg->getNext<ReduceDataMsg>();
}
}
}
I behaves exactly like with the builtin handler. In fact it is regardless the reduction operation (plus, min, max etc).
tutorial_main.h
.
After several attempts, it turns out that activeMessageGroupCollective
and activeMessageReduce
interfere themselves, even if already separated by a barrier.
::vt::theCollective()->barrier();
activeMessageGroupCollective();
::vt::theCollective()->barrier();
activeMessageReduce();
In fact activeMessageGroupCollective
only enables even nodes.
If I change the order of calls, then everything is ok (40/40 successful runs on 24 ranks).
It turns out that there may be an issue in theGroup()->newGroupCollective
in tutorial_1f.h
.
I suspect that it caused some ranks to not be engaged in the following reduction.
It makes me think that it does not concern the reduction at all, unlike suggested by the error output.
I will dig deeper and push soon.
@hobywan The barrier only ensures that all nodes execute to that point in the code (not that all work is executed). So it could very well be an interference problem
@hobywan I might know the problem. Creating a new collective group uses the reduction infrastructure. I think that the system reduction for creating the new collective group might be interfering with the tutorial reduction.
@lifflander That sort of interference suggests an issue with the interface design of the reductions. The system should be working in an entire isolated context from any user code, and user code should be able to use distinct contexts for anything that's not itself sequenced.
@PhilMiller I agree. First, we need to confirm that this is actually the issue and then solve the broader problem
@hobywan Please try the branch feature-261-reduction-bug
to see if you can recreate the bug. If you can't, this confirms the problem. I don't want to merge this code necessarily but find a better fix for this.
I tried the fix in #261 and I could not reproduce the bug anymore (50/50 successful runs with 24 ranks). It seems that the bug is due to a lack of context isolation, which causes the collective group creation in tutorial_1f.h
and the reduction in the tutorial_1e.h
to interfere each other.
I was about to get into the collective group mechanism and to provide tests by the way.
@hobywan I really like the changes you made; looks great! Should we create a pull request for this? Do you have ideas on how to isolate the reductions in a more general way?
@lifflander Thanks 😊
I think that it may be worthwhile to merge the test updates at least, and the quick fix in aec149b to make the tutorial
work correctly.
I do not know robust ways to fix this isolation issue in a general way yet.
I will think about it when trying to fix the problem in #240 though.
I have taken into account your fixes in #240 to figure out if both issues are related.
I thought that correct reset of static data within group_manager
would have an impact on it.
I re-ran the tutorial without the explicit tag within the call to reduce
and keep the initial call order of activeMessageGroupCollective
and activeMessageReduce
.
It still fails for 7/40 runs on 16 ranks though.
It confirms that this isolation problem is a separated issue.
In the first place, I would keep the idea of managing a default-valued tag for each call to reduce
.
For that, a static counter initialized with a random seed, or a randomly generated tag could be could be supplied within theCollective
, instead of a unique default tag. It would provide different default tags for each call to the collective, even if invoked in parallel with the same data and node group.
The fix aec149b is still needed. The tag needs to be unique.
Any random solution would require that it be seeded uniformly across all the nodes. If we have two distinct reductions the message handlers that start these could execute in any order. Any counter (or random) solution will not work correctly.
Feature #290 is the longer term solution to this problem. For now, I suggest we merge aec149b, as the short term fix
@hobywan After thinking about this last night, I realized there is a fix that will be correct. The reduce API takes a tag and epoch. We should use a special tag, but also use the group ID as the epoch. Therefore if multiple groups are created concurrently no matter what order they progress they will be correct. It should be fairly easy to implement this. What do you think about that?
I agree. The combination of a special tag and an epoch should normally prevent from concurrency issues. In addition it should be quite straightforward to implement. Do you have specific requirements on how tags should be generated though?
The theCollective->reduce(..)
already takes an epoch. Just pass group_
(in that function) obtained from getGroupID()
as the epoch. For the tag, let's create a header file with "system" tags for reduce.
Indeed, using the group ID as an epoch is straightforward. The tag could be generated from a hash involving this ID or not. I can test that soon if it fits to you.
Yes, sounds good
Maybe I'm missing something, but it sounds like you're describing using an arbitrary value from outside the termination detection machinery as the value of an 'epoch' field. If so, that strikes me as confusing and potentially dangerous for future changes.
Could you maybe partition the bits in the tag
field and use the ID in question as part of that, in the same way?
@PhilMiller No, I’m suggesting the epoch be the group ID, which is guaranteed unique and consistent across nodes for a given group. Groups are async created so this means that there reductions will not interfere. The tag is a slightly broader problem but we are going to create system tags for the group.
Sent with GitHawk
Ah, now I get what you are saying. Epochs are just identifiers used for different purposes right now. You can already pass an epoch to the reduce API and it has nothing to do with TD. Perhaps we should change the name.
Sent with GitHawk
Ah, now I get what you are saying. Epochs are just identifiers used for different purposes right now. You can already pass an epoch to the reduce API and it has nothing to do with TD. Perhaps we should change the name.
Yeah, please call that phase, or step, or sequence, or anything other than epoch.
Merged, closing
Failed 10/40 runs at 16 ranks on kahuna. On a failure, I saw the follwing output: