Investigate performance regressions from Meltdown/Spectre patches

ronawho commented 6 years ago

We saw some non-trivial performance regressions for 16-node-xc around the time patches for the Meltdown and Spectre were applied: https://chapel-lang.org/perf/16-node-xc/?startdate=2018/02/21&enddate=2018/03/12&graphs=npbepperfmopssized,hpccraatomicsperfgupsn233,hpccrarmoperfgupsn233,hpccglobalstreamperfgbsn5723827200,emptyremotetaskspawntime

The two most concerning cases that were impacted were RA atomic/rmo and stream-global. These are particularly concerning because the reference version wasn't impacted. Our numbers look fine in release-over-release timings, but we are now performing worse than the reference C+MPI+OpenMP versions when we were on par before.

When investigating we should probably start with stream-global since it's the simplest benchmark. I don't think we're doing many (any?) kernel calls for stream-global's timed section, so I was surprised to see the performance regression (though I don't fully understand that implications of the meltdown/spectre patches yet.)

Since stream-ep wasn't really impacted my guess is that our coforall+on idiom was hurt, but the coforall+on microbenchmark wasn't hurt under ugni. The microbenchmark probably ends up using small AMs, so it's possible that large AMs (which will be used for stream-global) were hurt. It's probably worth beefing up that benchmark to include a case that will trigger large AMs.

When investigating this we should also try running on newer hardware (skylake or newer) to see if the performance impacted is smaller on newer hardware.

ronawho commented 6 years ago

It looks like interference/contention from the progress thread is much more significant with the patches. Maybe context switching is slower now, and so the pthread/worker hosting the last task (that shares a core with the progress thread) is slowed down even further?

Fortunately we had already been looking into progress thread interference: https://github.com/chapel-lang/chapel/pull/8562

Running with CHPL_RT_COMM_UGNI_BLOCKING_CQ=y, I see even more significant improvements than before, and more importantly the regressions noted above are resolved.

Here's a quick summary of the differences:

Benchmark	Perf gain
CoMD elegant	10% improvement
CoMD llnl	30% improvement
NPB EP	90% improvement
RA-rmo	20% improvement
PRK stencil block	70% improvement
PRK stencil stencil	70% improvement
stream-ep	2% improvement
stream-global/prom	10% improvement
single large reduction	20% improvement

Benchmark	Perf loss
coforall+on microbench	90% regression
coforall+on net AMO	15% regression
many small reductions	25% regression
lulesh	10% regression
hpl	5% regression

Note that there are some notable regressions for the coforall+on and mult-trial reduction microbenchmarks, and minor ones for lulesh/hpl. The coforall+on and multi-trial reduction do a lot of coforall+ons with very little work per trial. So it seems like coforall+on has gotten a little slower, but for benchmarks doing any real amount of work the decreased contention from the progress thread far outweighs the slight increase in spawning times

bradcray commented 6 years ago

I've said to Elliot offline, but I think we should consider flipping the default for CHPL_RT_COMM_UGNI_BLOCKING_CQ.

ronawho commented 6 years ago

The investigation part of this is done. I've opened https://github.com/chapel-lang/chapel/issues/9067 to track the core problem (increased context switch cost increases progress thread interference).

https://github.com/chapel-lang/chapel/pull/9068 resolves this for ugni, but not for gasnet. ugni is the more important config, but at some point down the road we'll want to address this for gasnet as well.

chapel-lang / chapel

Investigate performance regressions from Meltdown/Spectre patches #8969