Investigate stream promoted perf loss from iterator memory management rework

ronawho commented 6 years ago

There was a non-trivial performance regression for stream-promoted under gn-mpi as a result reworking iterator memory management (https://github.com/chapel-lang/chapel/pull/8073)

We missed this during our normal performance triage (we don't triage gn-mpi regularly), though it is visible: https://chapel-lang.org/perf/16-node-xc/?startdate=2018/01/06&enddate=2018/01/25&configs=gnuugniqthreads,gnugasnetmpi&graphs=hpccpromotedstreamperfgbsn5723827200

Since stream is such a simple yet fundamental/core benchmark, it would be good to understand the cause of the performance loss and address it.

It is worth noting that comm=ugni wasn't impacted, which is curious and may be helpful in tracking down the root cause of the regression (could be extra comm and ugni is fast enough that it doesn't show, or extra allocations and with the registered heap ugni+jemalloc won't return memory to the system so repeated allocations can be faster). It's also worth noting that stream-global wasn't impacted, so this is likely a code-path specific to how we build promotion wrappers.

Note that we discovered this after the fact in the release-over-release timings:

stream-prom

ronawho commented 6 years ago

TODO: look at total memory used and comm counts before/after that PR

ronawho commented 6 years ago

I didn't see any obviously bad differences in the generated code before and after this change but there's a lot of it to look at.

If we look into this again, I think we should just incrementally revert changes in this PR to try and identify the root cause. I don't understand the PR all that well, so that's probably something best left to @mppf, but we can figure that out if/when we look at this again.

cassella commented 5 years ago

It looks like there have been a few more (unannotated) regressions since the one in January 2018: https://chapel-lang.org/perf/16-node-xc/?startdate=2018/01/01&enddate=2019/03/21&configs=gnuugniqthreads,gnugasnetmpi&graphs=hpccpromotedstreamperfgbsn5723827200

ronawho commented 5 years ago

Good timing -- new release-over-release graphs agree with you:

stream prom

ronawho commented 5 years ago

I think this effectively comes down down "surprisingly, promoted stream has communication in the kernel" -- I've opened https://github.com/chapel-lang/chapel/issues/12761 to track this

ronawho commented 5 years ago

And for more context it looks like #8073 introduced 4 GETs per task on non-zero locales, which I think is the cause of this regression.

bradcray commented 5 years ago

What are they trying to "GET"?

mppf commented 3 years ago

It looks like we still have this issue & that enabling the cache-remote by default significantly reduced the impact. See e.g. https://chapel-lang.org/perf/16-node-xc/?startdate=2016/12/31&enddate=2021/10/10&configs=gnugasnetmpi&graphs=hpccpromotedstreamperfgbsn5723827200 .

I think that investigating #12761 to see if we can eliminate the communication entirely is the best next step here.

chapel-lang / chapel

Investigate stream promoted perf loss from iterator memory management rework #8968