chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

Investigate performance regressions from Meltdown/Spectre patches #8969

Closed ronawho closed 6 years ago

ronawho commented 6 years ago

We saw some non-trivial performance regressions for 16-node-xc around the time patches for the Meltdown and Spectre were applied: https://chapel-lang.org/perf/16-node-xc/?startdate=2018/02/21&enddate=2018/03/12&graphs=npbepperfmopssized,hpccraatomicsperfgupsn233,hpccrarmoperfgupsn233,hpccglobalstreamperfgbsn5723827200,emptyremotetaskspawntime

The two most concerning cases that were impacted were RA atomic/rmo and stream-global. These are particularly concerning because the reference version wasn't impacted. Our numbers look fine in release-over-release timings, but we are now performing worse than the reference C+MPI+OpenMP versions when we were on par before.

When investigating we should probably start with stream-global since it's the simplest benchmark. I don't think we're doing many (any?) kernel calls for stream-global's timed section, so I was surprised to see the performance regression (though I don't fully understand that implications of the meltdown/spectre patches yet.)

Since stream-ep wasn't really impacted my guess is that our coforall+on idiom was hurt, but the coforall+on microbenchmark wasn't hurt under ugni. The microbenchmark probably ends up using small AMs, so it's possible that large AMs (which will be used for stream-global) were hurt. It's probably worth beefing up that benchmark to include a case that will trigger large AMs.

When investigating this we should also try running on newer hardware (skylake or newer) to see if the performance impacted is smaller on newer hardware.

ronawho commented 6 years ago

It looks like interference/contention from the progress thread is much more significant with the patches. Maybe context switching is slower now, and so the pthread/worker hosting the last task (that shares a core with the progress thread) is slowed down even further?

Fortunately we had already been looking into progress thread interference: https://github.com/chapel-lang/chapel/pull/8562

Running with CHPL_RT_COMM_UGNI_BLOCKING_CQ=y, I see even more significant improvements than before, and more importantly the regressions noted above are resolved.

Here's a quick summary of the differences:

Benchmark Perf gain
CoMD elegant 10% improvement
CoMD llnl 30% improvement
NPB EP 90% improvement
RA-rmo 20% improvement
PRK stencil block 70% improvement
PRK stencil stencil 70% improvement
stream-ep 2% improvement
stream-global/prom 10% improvement
single large reduction 20% improvement
Benchmark Perf loss
coforall+on microbench 90% regression
coforall+on net AMO 15% regression
many small reductions 25% regression
lulesh 10% regression
hpl 5% regression

Note that there are some notable regressions for the coforall+on and mult-trial reduction microbenchmarks, and minor ones for lulesh/hpl. The coforall+on and multi-trial reduction do a lot of coforall+ons with very little work per trial. So it seems like coforall+on has gotten a little slower, but for benchmarks doing any real amount of work the decreased contention from the progress thread far outweighs the slight increase in spawning times

bradcray commented 6 years ago

I've said to Elliot offline, but I think we should consider flipping the default for CHPL_RT_COMM_UGNI_BLOCKING_CQ.

ronawho commented 6 years ago

The investigation part of this is done. I've opened https://github.com/chapel-lang/chapel/issues/9067 to track the core problem (increased context switch cost increases progress thread interference).

https://github.com/chapel-lang/chapel/pull/9068 resolves this for ugni, but not for gasnet. ugni is the more important config, but at some point down the road we'll want to address this for gasnet as well.