tutorial test on 16 ranks fails intermittently

PhilMiller commented 5 years ago

Failed 10/40 runs at 16 ranks on kahuna. On a failure, I saw the follwing output:

180: Test command: /opt/spack/c7.6/stock-20190108/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/openmpi-3.1.3-acxkgcmzmuwfofeqduyiw6mlxc72jiih/bin/mpiexec "-n" "16" "./tutorial"
180: Test timeout computed to be: 60
180: vt: Runtime Initializing: mode: single-thread per rank
180: vt: Running on: 16 nodes
180: vt: Machine Hostname: kn20.eth
180: vt: Build SHA: bf31cdec7282fac8d1921ee86b09d10219f0e899
180: vt: Build Ref: refs/heads/develop
180: vt: Description: heads/develop-0-gbf31cdec72 
180: vt: Compile-time Features Enabled:
180: vt:    OpenMP Threading
180: vt:    Memory Pooling
180: vt:    C++ Trait Detector
180: vt: Runtime Configuration:
180: vt:    Default: Termination hang detection enabled by default, use --vt_no_detect_hang to disable
180: vt:    Option: flag --vt_hang_detect on: Detecting hang every 1024 tree traversals 
180: vt:    Default: SIGINT signal handling enabled by default, use --vt_no_SIGINT to disable
180: vt:    Default: SIGSEGV signal handling enabled by default, use --vt_no_SIGSEGV to disable
180: vt:    Default: Color output enabled by default, use --vt_no_color to disable
180: vt:    Default: Stack dumps enabled by default, use --vt_no_stack to disable
180: this_node=0, num_ndoes=16
180: msgHandlerA: triggered on node=1
180: msgHandlerX: triggered on node=2
180: msgHandlerX: triggered on node=6
180: msgHandlerX: triggered on node=14
180: msgSerialA: triggered on node=1
180: msgHandlerX: triggered on node=13
180: msgHandlerX: triggered on node=5
180: msgHandlerX: triggered on node=11
180: msgHandlerX: triggered on node=12
180: msgHandlerX: triggered on node=1
180: msgHandlerB: triggered on node=0
180: msgHandlerX: triggered on node=3
180: msgHandlerX: triggered on node=4
180: msgHandlerX: triggered on node=7
180: msgHandlerX: triggered on node=9
180: msgHandlerX: triggered on node=8
180: msgHandlerX: triggered on node=10
180: msgHandlerX: triggered on node=15
180: getCallbackHandler: triggered on node=1
180: getCallbackHandler: triggered on node=1
180: getCallbackHandler: triggered on node=1
180: 1: triggering active message callback bcast
180: 3: triggering active message callback bcast
180: 4: triggering active message callback bcast
180: 9: triggering active message callback bcast
180: 7: triggering active message callback bcast
180: 8: triggering active message callback bcast
180: 10: triggering active message callback bcast
180: getCallbackHandler: triggered on node=1
180: 15: triggering active message callback bcast
180: 0: triggering void function callback
180: 0: triggering active message callback
180: 0: triggering active message callback bcast
180: 0: triggering context callback
180: 2: triggering active message callback bcast
180: Group is created: id=2
180: 14: triggering active message callback bcast
180: 13: triggering active message callback bcast
180: 11: triggering active message callback bcast
180: 6: triggering active message callback bcast
180: msgHandlerGroupA: triggered on node=1
180: msgHandlerGroupA: triggered on node=2
180: 5: triggering active message callback bcast
180: 12: triggering active message callback bcast
180: msgHandlerGroupA: triggered on node=14
180: msgHandlerGroupA: triggered on node=9
180: msgHandlerGroupA: triggered on node=7
180: msgHandlerGroupA: triggered on node=13
180: msgHandlerGroupA: triggered on node=11
180: msgHandlerGroupA: triggered on node=8
180: msgHandlerGroupA: triggered on node=12
180: msgHandlerGroupA: triggered on node=6
180: msgHandlerGroupA: triggered on node=15
180: msgHandlerGroupA: triggered on node=10
180: msgHandlerGroupA: triggered on node=3
180: msgHandlerGroupA: triggered on node=4
180: msgHandlerGroupA: triggered on node=5
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: msgHandlerGroupB: triggered on node=0
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: msgHandlerGroupB: triggered on node=2
180: MyCol::msgHandler index=5, node=2
180: 1: recurHandler: ttl=7, triggered
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: Group is created: id=fffc00000007ffff
180: msgHandlerGroupB: triggered on node=6
180: msgHandlerGroupB: triggered on node=14
180: msgHandlerGroupB: triggered on node=12
180: tutorial: /home/pbmille/repos/vt-auto-build/vt/vt/examples/../tutorial/tutorial_1h.h:63: void vt::tutorial::ReduceResult::operator()(vt::tutorial::ReduceDataMsg*): Assertion `num_nodes * 50 == msg->getConstVal()' failed.
180: msgHandlerGroupB: triggered on node=8
180: reduction value=144
180: msgHandlerGroupB: triggered on node=10
180: [kn20:39172] *** Process received signal ***
180: [kn20:39172] Signal: Aborted (6)
180: [kn20:39172] Signal code:  (-6)
180: msgHandlerGroupB: triggered on node=4
180: [kn20:39172] [ 0] /usr/lib64/libpthread.so.0(+0xf5d0)[0x7fbebf5235d0]
180: [kn20:39172] [ 1] /usr/lib64/libc.so.6(gsignal+0x37)[0x7fbebe6b1207]
180: [kn20:39172] [ 2] /usr/lib64/libc.so.6(abort+0x148)[0x7fbebe6b28f8]
180: [kn20:39172] [ 3] /usr/lib64/libc.so.6(+0x2f026)[0x7fbebe6aa026]
180: [kn20:39172] [ 4] /usr/lib64/libc.so.6(+0x2f0d2)[0x7fbebe6aa0d2]
180: [kn20:39172] [ 5] ./tutorial[0x43709c]
180: [kn20:39172] [ 6] ./tutorial[0x440677]
180: [kn20:39172] [ 7] ./tutorial[0x891081]
180: [kn20:39172] [ 8] ./tutorial[0x89112b]
180: [kn20:39172] [ 9] ./tutorial[0x88a7b4]
180: [kn20:39172] [10] ./tutorial[0x87ede6]
180: [kn20:39172] [11] ./tutorial[0x891f24]
180: [kn20:39172] [12] ./tutorial[0x49267f]
180: [kn20:39172] [13] ./tutorial[0x769c78]
180: [kn20:39172] [14] ./tutorial[0x767941]
180: [kn20:39172] [15] ./tutorial[0x76e9bf]
180: [kn20:39172] [16] ./tutorial[0x76f2f6]
180: [kn20:39172] [17] ./tutorial[0x8068ed]
180: [kn20:39172] [18] ./tutorial[0x8069f0]
180: [kn20:39172] [19] ./tutorial[0x8072af]
180: [kn20:39172] [20] ./tutorial[0x412395]
180: [kn20:39172] [21] /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fbebe69d3d5]
180: [kn20:39172] [22] ./tutorial[0x40f37a]
180: [kn20:39172] *** End of error message ***
180: --------------------------------------------------------------------------
180: Primary job  terminated normally, but 1 process returned
180: a non-zero exit code. Per user-direction, the job has been aborted.
180: --------------------------------------------------------------------------
180: --------------------------------------------------------------------------
180: mpiexec noticed that process rank 0 with PID 39172 on node kn20 exited on signal 6 (Aborted).
180: --------------------------------------------------------------------------
1/1 Test #180: tutorial_16 ......................***Failed    1.89 sec

lifflander commented 5 years ago

This is a real bug. Looks like the value of the reduction is wrong sometimes (extracting this from the above output)!

vt/examples/../tutorial/tutorial_1h.h:63:
void vt::tutorial::ReduceResult::operator()(vt::tutorial::ReduceDataMsg*):
Assertion `num_nodes * 50 == msg->getConstVal()' failed.

lifflander commented 5 years ago

Meriadeg: can you look into this issue and try to fix this bug?

lifflander commented 5 years ago

@mperrinel Have you had a chance to look into this bug?

hobywan commented 5 years ago

@PhilMiller @lifflander I could reproduce it on 16 ranks with oversubscription. I made several tests to figure out the root cause of the bug. Regarding the output,

I thought that there were a bug within the reduction collective itself at first glance. I wanted to run the unit tests in test_collective_reduce to check potential failure cases. In fact there were no assertions on expected values in any test, just only debug prints. So I updated the fixture by adding these missing assertions in first place.
```
enum struct ReduceOP : int {
Plus = 0,
Max  = 1,
Min  = 2
};
```

template struct Verify {

void operator()(SysMsg* msg) { auto const value = msg->getConstVal(); auto const n = vt::theContext()->getNumNodes();

switch (oper) {
  case ReduceOP::Plus: EXPECT_EQ(value, n * (n - 1)/2); break;
  case ReduceOP::Min:  EXPECT_EQ(value, 0); break;
  case ReduceOP::Max:  EXPECT_EQ(value, n - 1); break;
  default: vtAbort("Failure: should not be reached"); break;
}

} };

I ran the tests several times and everything is fine even on 24 ranks.

- I noticed that the example in `tutorial_1h.h` is pretty similar to `test_reduce_plus_default_op`. 
I was wondering why it behaves differently though. 
I changed the default builtin plus handler to a custom one:
```c++
static inline void reducePlus(ReduceDataMsg* msg) {
  if (msg->isRoot()) {
    auto const n = ::vt::theContext()->getNumNodes();
    assert(n * 50 == msg->getConstVal());
  } else {
    auto* fst_msg = msg;
    auto* cur_msg = msg->getNext<ReduceDataMsg>();
    while (cur_msg not_eq nullptr) {
      fst_msg->getVal() += cur_msg->getConstVal();
      cur_msg = cur_msg->getNext<ReduceDataMsg>();
    }
  }
}

I behaves exactly like with the builtin handler. In fact it is regardless the reduction operation (plus, min, max etc).

I began to check whether there could be a conflict with other calls in tutorial_main.h. After several attempts, it turns out that activeMessageGroupCollective and activeMessageReduce interfere themselves, even if already separated by a barrier.
```
::vt::theCollective()->barrier();
activeMessageGroupCollective();
::vt::theCollective()->barrier();
activeMessageReduce();
```
In fact activeMessageGroupCollective only enables even nodes. If I change the order of calls, then everything is ok (40/40 successful runs on 24 ranks). It turns out that there may be an issue in theGroup()->newGroupCollective in tutorial_1f.h. I suspect that it caused some ranks to not be engaged in the following reduction. It makes me think that it does not concern the reduction at all, unlike suggested by the error output. I will dig deeper and push soon.

lifflander commented 5 years ago

@hobywan The barrier only ensures that all nodes execute to that point in the code (not that all work is executed). So it could very well be an interference problem

lifflander commented 5 years ago

@hobywan I might know the problem. Creating a new collective group uses the reduction infrastructure. I think that the system reduction for creating the new collective group might be interfering with the tutorial reduction.

PhilMiller commented 5 years ago

@lifflander That sort of interference suggests an issue with the interface design of the reductions. The system should be working in an entire isolated context from any user code, and user code should be able to use distinct contexts for anything that's not itself sequenced.

lifflander commented 5 years ago

@PhilMiller I agree. First, we need to confirm that this is actually the issue and then solve the broader problem

lifflander commented 5 years ago

@hobywan Please try the branch feature-261-reduction-bug to see if you can recreate the bug. If you can't, this confirms the problem. I don't want to merge this code necessarily but find a better fix for this.

hobywan commented 5 years ago

I tried the fix in #261 and I could not reproduce the bug anymore (50/50 successful runs with 24 ranks). It seems that the bug is due to a lack of context isolation, which causes the collective group creation in tutorial_1f.h and the reduction in the tutorial_1e.h to interfere each other. I was about to get into the collective group mechanism and to provide tests by the way.

lifflander commented 5 years ago

@hobywan I really like the changes you made; looks great! Should we create a pull request for this? Do you have ideas on how to isolate the reductions in a more general way?

hobywan commented 5 years ago

@lifflander Thanks 😊 I think that it may be worthwhile to merge the test updates at least, and the quick fix in aec149b to make the tutorial work correctly. I do not know robust ways to fix this isolation issue in a general way yet. I will think about it when trying to fix the problem in #240 though.

hobywan commented 5 years ago

I have taken into account your fixes in #240 to figure out if both issues are related. I thought that correct reset of static data within group_manager would have an impact on it. I re-ran the tutorial without the explicit tag within the call to reduce and keep the initial call order of activeMessageGroupCollective and activeMessageReduce. It still fails for 7/40 runs on 16 ranks though. It confirms that this isolation problem is a separated issue.

In the first place, I would keep the idea of managing a default-valued tag for each call to reduce. For that, a static counter initialized with a random seed, or a randomly generated tag could be could be supplied within theCollective, instead of a unique default tag. It would provide different default tags for each call to the collective, even if invoked in parallel with the same data and node group.

lifflander commented 5 years ago

The fix aec149b is still needed. The tag needs to be unique.

Any random solution would require that it be seeded uniformly across all the nodes. If we have two distinct reductions the message handlers that start these could execute in any order. Any counter (or random) solution will not work correctly.

Feature #290 is the longer term solution to this problem. For now, I suggest we merge aec149b, as the short term fix

lifflander commented 5 years ago

@hobywan After thinking about this last night, I realized there is a fix that will be correct. The reduce API takes a tag and epoch. We should use a special tag, but also use the group ID as the epoch. Therefore if multiple groups are created concurrently no matter what order they progress they will be correct. It should be fairly easy to implement this. What do you think about that?

hobywan commented 5 years ago

I agree. The combination of a special tag and an epoch should normally prevent from concurrency issues. In addition it should be quite straightforward to implement. Do you have specific requirements on how tags should be generated though?

lifflander commented 5 years ago

The theCollective->reduce(..) already takes an epoch. Just pass group_ (in that function) obtained from getGroupID() as the epoch. For the tag, let's create a header file with "system" tags for reduce.

hobywan commented 5 years ago

Indeed, using the group ID as an epoch is straightforward. The tag could be generated from a hash involving this ID or not. I can test that soon if it fits to you.

lifflander commented 5 years ago

Yes, sounds good

PhilMiller commented 5 years ago

Maybe I'm missing something, but it sounds like you're describing using an arbitrary value from outside the termination detection machinery as the value of an 'epoch' field. If so, that strikes me as confusing and potentially dangerous for future changes.

Could you maybe partition the bits in the tag field and use the ID in question as part of that, in the same way?

lifflander commented 5 years ago

@PhilMiller No, I’m suggesting the epoch be the group ID, which is guaranteed unique and consistent across nodes for a given group. Groups are async created so this means that there reductions will not interfere. The tag is a slightly broader problem but we are going to create system tags for the group.

_{Sent with GitHawk}

lifflander commented 5 years ago

Ah, now I get what you are saying. Epochs are just identifiers used for different purposes right now. You can already pass an epoch to the reduce API and it has nothing to do with TD. Perhaps we should change the name.

_{Sent with GitHawk}

PhilMiller commented 5 years ago

Ah, now I get what you are saying. Epochs are just identifiers used for different purposes right now. You can already pass an epoch to the reduce API and it has nothing to do with TD. Perhaps we should change the name.

Yeah, please call that phase, or step, or sequence, or anything other than epoch.

lifflander commented 5 years ago

Merged, closing

DARMA-tasking / vt

tutorial test on 16 ranks fails intermittently #261