spike: diagnose and fix comm=ofi testing failures

gbtitus commented 5 years ago

A preliminary full testing run with comm=ofi and the makeshift mpirun4ofi launcher on our chapcs cluster covered the first 292 of our test suites before the slurm job allocation it was running in timed out. It produced 2444 passes and 26 failures. Separately, a run of the multilocale suite produced 144 passes and 15 failures.

A run of the multilocale suite with comm=ofi on a Cray XC system produces 15-20 failures depending on circumstances (provider gni vs. sockets, which tasking layer, etc.).

In this spike I will diagnose and fix as many of these failures as I can, and also extend the full testing coverage to more suites.

gbtitus commented 5 years ago

I completed a full chapcs testing run on 2018-01-07. There were 193 failures. They broke down as follows:

101 timeouts. I reran these on Cray XC with provider=sockets and got 99 passes and 10 failures. (There were more total test runs because timeouts didn't necessarily occur for all compops/execopts settings for tests with more than one, so the re-runs included some things that had passed the first time.) I believe this just means the reduced comm performance with the sockets provider on chapcs11 makes our default timeout insufficient for those tests, but need to confirm this by doing a special test run that reports comm diagnostics. None of the 10 failures in the XC re-run were timeouts, but I haven't yet diagnosed their causes.
24 failures that looked like they were due to actual comm layer problems. These could be further subdivided:
- 7 failures with an "array index out of bounds" message at execution time.
- 6 failures with an "attempt to dereference nil" message at execution time.
- 11 other failures. Example: a divide-by-zero in release/examples/benchmarks/hpcc/ra.
22 comm diagnostics mismatches. Most of these are due to comm=ofi as yet lacking executeOnFast support.
16 issues with .good files. A good portion of these are for tests where we have .good and .comm-gasnet.good files, but the difference in test behavior is actually due to comm==none and comm!=none. These can be resolved just be renaming .good to .comm-none.good and .comm-gasnet.good to .good. As a beneficial side effect, many or most of such tests will now pass with comm=ugni, where they didn't before.
8 changes in output order. That is, all the output was there, but with some lines rearranged. In a few (maybe all) cases this was due to the mpirun launcher combining stdout and stderr differently than the .good file expected.
4 failures due to a compiler internal error with minimal modules. This is just a test problem: minimal modules doesn't work comm!=none, but we were only avoiding it for comm==gasnet. This also avoids failures with comm=ugni, by not running the affected tests in that configuration.
3 failures due to static linking in the static_dynamic test. There isn't a static OpenMPI libmpi, so we just need to avoid this test when MPI is being used.
12 unknown failures that need further analysis.
3 sporadic failures, not related to the comm layer, that passed on retry:
- 1 "clock skew" message during user program make.
- 1 "cat: coords.out: No such file or directory" message each, for lulesh and test3DLulesh.

bradcray commented 5 years ago

A few thoughts:

For the failures related to comm==gasnet vs. comm!=none, I think switching the sense of things is the clearly right thing to do.
For the timeouts do you know whether they are because we're running slower and truly timing out rather than deadlocking, say?
The "array out of bounds" and "attempt to dereference nil" errors seem surprising to me given that those are at the Chapel level well above the comm layer. Unless the comm layer was overwriting arbitrary memory? (can this configuration be run with valgrind?) That said, I'd personally probably chase after the divide by zero issue in RA before these ones because that one seems weirder.

gbtitus commented 5 years ago

For the timeouts do you know whether they are because we're running slower and truly timing out rather than deadlocking, say?

I don't know this for sure, but since they ran to completion on a Cray XC with comm=ofi and the sockets provider I think they probably just reflect much slower communication using sockets on vanilla Linux over chapcs's IB than using sockets on CLE over XC's Aries. (In other words, I don't believe XC's advantage here is limited to just having better network hardware.) But I do need to do some runs to confirm this.

The "array out of bounds" and "attempt to dereference nil" errors seem surprising to me given that those are at the Chapel level well above the comm layer. Unless the comm layer was overwriting arbitrary memory? (can this configuration be run with valgrind?) That said, I'd personally probably chase after the divide by zero issue in RA before these ones because that one seems weirder.

I think all these are symptomatic of one or more communication bugs. I.e., we're picking up trash that leads to mis-indexing and nil derefs and div-by-zero, because somewhere we're sending or receiving the wrong bits or the wrong number of them. It's plausible there's just one fundamental problem, with a lot of symptoms.

gbtitus commented 5 years ago

22 comm diagnostics mismatches. Most of these are due to comm=ofi as yet lacking executeOnFast support.

My diagnosis here was incorrect. Comm=ofi indeed doesn't fully support executeOnFast, in the sense that instead of calling the on-stmt body function directly from the AM handler it runs it in a task as if it were a regular executeOn. But it nevertheless reports executeOnFast in the comm diags as if it were supported, so the output should match.

What's really happening in all of the comm=ofi comm diags mismatches is that the existing .good files expect what comm=gasnet produces, because that's the only comm config we run these in. Comm=gasnet doesn't (yet) support network atomics, so the .good files reflect some executeOnFast ops which implement remote AMOs. But comm=ofi supports network atomics directly, so it doesn't need to use executeOnFast for those, and they aren't in its comm diags report.

As supporting information, I've confirmed that comm=ugni (which also supports network atomics) produces the same output as comm=ofi does for all these failures, with one exception which appears to be just an output ordering difference.

After discussion with @ronawho, the plan to deal with these is to introduce another sub_test .good file naming flavor, which will allow using <test>.na-${CHPL_NETWORK_ATOMICS}.good to specify network atomics-specific .good files. Then I'll rename the existing .good files for these failing tests to be .na-none.good so they'll match the comm=gasnet output, and add new plain .good files which match the comm=ofi and comm=ugni output.

@bradcray, if you have a moment I'd appreciate a a yay/nay from you on this plan.

bradcray commented 5 years ago

This sounds OK to me.

chapel-lang / chapel

spike: diagnose and fix comm=ofi testing failures #11942