chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.8k stars 421 forks source link

spike: diagnose and fix comm=ofi testing failures #11942

Closed gbtitus closed 5 years ago

gbtitus commented 5 years ago

A preliminary full testing run with comm=ofi and the makeshift mpirun4ofi launcher on our chapcs cluster covered the first 292 of our test suites before the slurm job allocation it was running in timed out. It produced 2444 passes and 26 failures. Separately, a run of the multilocale suite produced 144 passes and 15 failures.

A run of the multilocale suite with comm=ofi on a Cray XC system produces 15-20 failures depending on circumstances (provider gni vs. sockets, which tasking layer, etc.).

In this spike I will diagnose and fix as many of these failures as I can, and also extend the full testing coverage to more suites.

gbtitus commented 5 years ago

I completed a full chapcs testing run on 2018-01-07. There were 193 failures. They broke down as follows:

bradcray commented 5 years ago

A few thoughts:

gbtitus commented 5 years ago
  • For the timeouts do you know whether they are because we're running slower and truly timing out rather than deadlocking, say?

I don't know this for sure, but since they ran to completion on a Cray XC with comm=ofi and the sockets provider I think they probably just reflect much slower communication using sockets on vanilla Linux over chapcs's IB than using sockets on CLE over XC's Aries. (In other words, I don't believe XC's advantage here is limited to just having better network hardware.) But I do need to do some runs to confirm this.

  • The "array out of bounds" and "attempt to dereference nil" errors seem surprising to me given that those are at the Chapel level well above the comm layer. Unless the comm layer was overwriting arbitrary memory? (can this configuration be run with valgrind?) That said, I'd personally probably chase after the divide by zero issue in RA before these ones because that one seems weirder.

I think all these are symptomatic of one or more communication bugs. I.e., we're picking up trash that leads to mis-indexing and nil derefs and div-by-zero, because somewhere we're sending or receiving the wrong bits or the wrong number of them. It's plausible there's just one fundamental problem, with a lot of symptoms.

gbtitus commented 5 years ago
  • 22 comm diagnostics mismatches. Most of these are due to comm=ofi as yet lacking executeOnFast support.

My diagnosis here was incorrect. Comm=ofi indeed doesn't fully support executeOnFast, in the sense that instead of calling the on-stmt body function directly from the AM handler it runs it in a task as if it were a regular executeOn. But it nevertheless reports executeOnFast in the comm diags as if it were supported, so the output should match.

What's really happening in all of the comm=ofi comm diags mismatches is that the existing .good files expect what comm=gasnet produces, because that's the only comm config we run these in. Comm=gasnet doesn't (yet) support network atomics, so the .good files reflect some executeOnFast ops which implement remote AMOs. But comm=ofi supports network atomics directly, so it doesn't need to use executeOnFast for those, and they aren't in its comm diags report.

As supporting information, I've confirmed that comm=ugni (which also supports network atomics) produces the same output as comm=ofi does for all these failures, with one exception which appears to be just an output ordering difference.

After discussion with @ronawho, the plan to deal with these is to introduce another sub_test .good file naming flavor, which will allow using <test>.na-${CHPL_NETWORK_ATOMICS}.good to specify network atomics-specific .good files. Then I'll rename the existing .good files for these failing tests to be .na-none.good so they'll match the comm=gasnet output, and add new plain .good files which match the comm=ofi and comm=ugni output.

@bradcray, if you have a moment I'd appreciate a a yay/nay from you on this plan.

bradcray commented 5 years ago

This sounds OK to me.