Closed gbtitus closed 5 years ago
I completed a full chapcs testing run on 2018-01-07. There were 193 failures. They broke down as follows:
executeOnFast
support..good
files. A good portion of these are for tests where we have .good
and .comm-gasnet.good
files, but the difference in test behavior is actually due to comm==none and comm!=none. These can be resolved just be renaming .good
to .comm-none.good
and .comm-gasnet.good
to .good
. As a beneficial side effect, many or most of such tests will now pass with comm=ugni, where they didn't before.stdout
and stderr
differently than the .good
file expected.make
.lulesh
and test3DLulesh
.A few thoughts:
For the failures related to comm==gasnet vs. comm!=none, I think switching the sense of things is the clearly right thing to do.
For the timeouts do you know whether they are because we're running slower and truly timing out rather than deadlocking, say?
The "array out of bounds" and "attempt to dereference nil" errors seem surprising to me given that those are at the Chapel level well above the comm layer. Unless the comm layer was overwriting arbitrary memory? (can this configuration be run with valgrind?) That said, I'd personally probably chase after the divide by zero issue in RA before these ones because that one seems weirder.
- For the timeouts do you know whether they are because we're running slower and truly timing out rather than deadlocking, say?
I don't know this for sure, but since they ran to completion on a Cray XC with comm=ofi and the sockets provider I think they probably just reflect much slower communication using sockets on vanilla Linux over chapcs's IB than using sockets on CLE over XC's Aries. (In other words, I don't believe XC's advantage here is limited to just having better network hardware.) But I do need to do some runs to confirm this.
- The "array out of bounds" and "attempt to dereference nil" errors seem surprising to me given that those are at the Chapel level well above the comm layer. Unless the comm layer was overwriting arbitrary memory? (can this configuration be run with valgrind?) That said, I'd personally probably chase after the divide by zero issue in RA before these ones because that one seems weirder.
I think all these are symptomatic of one or more communication bugs. I.e., we're picking up trash that leads to mis-indexing and nil derefs and div-by-zero, because somewhere we're sending or receiving the wrong bits or the wrong number of them. It's plausible there's just one fundamental problem, with a lot of symptoms.
- 22 comm diagnostics mismatches. Most of these are due to comm=ofi as yet lacking
executeOnFast
support.
My diagnosis here was incorrect. Comm=ofi indeed doesn't fully support executeOnFast
, in the sense that instead of calling the on-stmt body function directly from the AM handler it runs it in a task as if it were a regular executeOn
. But it nevertheless reports executeOnFast
in the comm diags as if it were supported, so the output should match.
What's really happening in all of the comm=ofi comm diags mismatches is that the existing .good
files expect what comm=gasnet produces, because that's the only comm config we run these in.
Comm=gasnet doesn't (yet) support network atomics, so the .good
files reflect some executeOnFast
ops which implement remote AMOs. But comm=ofi supports network atomics directly, so it doesn't need to use executeOnFast
for those, and they aren't in its comm diags report.
As supporting information, I've confirmed that comm=ugni (which also supports network atomics) produces the same output as comm=ofi does for all these failures, with one exception which appears to be just an output ordering difference.
After discussion with @ronawho, the plan to deal with these is to introduce another sub_test
.good file naming flavor, which will allow using <test>.na-${CHPL_NETWORK_ATOMICS}.good
to specify network atomics-specific .good
files. Then I'll rename the existing .good
files for these failing tests to be .na-none.good
so they'll match the comm=gasnet output, and add new plain .good
files which match the comm=ofi and comm=ugni output.
@bradcray, if you have a moment I'd appreciate a a yay/nay from you on this plan.
This sounds OK to me.
A preliminary full testing run with comm=ofi and the makeshift
mpirun4ofi
launcher on our chapcs cluster covered the first 292 of our test suites before the slurm job allocation it was running in timed out. It produced 2444 passes and 26 failures. Separately, a run of themultilocale
suite produced 144 passes and 15 failures.A run of the
multilocale
suite with comm=ofi on a Cray XC system produces 15-20 failures depending on circumstances (provider gni vs. sockets, which tasking layer, etc.).In this spike I will diagnose and fix as many of these failures as I can, and also extend the full testing coverage to more suites.