Open ronawho opened 6 years ago
TODO: look at total memory used and comm counts before/after that PR
I didn't see any obviously bad differences in the generated code before and after this change but there's a lot of it to look at.
If we look into this again, I think we should just incrementally revert changes in this PR to try and identify the root cause. I don't understand the PR all that well, so that's probably something best left to @mppf, but we can figure that out if/when we look at this again.
It looks like there have been a few more (unannotated) regressions since the one in January 2018: https://chapel-lang.org/perf/16-node-xc/?startdate=2018/01/01&enddate=2019/03/21&configs=gnuugniqthreads,gnugasnetmpi&graphs=hpccpromotedstreamperfgbsn5723827200
Good timing -- new release-over-release graphs agree with you:
I think this effectively comes down down "surprisingly, promoted stream has communication in the kernel" -- I've opened https://github.com/chapel-lang/chapel/issues/12761 to track this
And for more context it looks like #8073 introduced 4 GETs per task on non-zero locales, which I think is the cause of this regression.
What are they trying to "GET"?
It looks like we still have this issue & that enabling the cache-remote by default significantly reduced the impact. See e.g. https://chapel-lang.org/perf/16-node-xc/?startdate=2016/12/31&enddate=2021/10/10&configs=gnugasnetmpi&graphs=hpccpromotedstreamperfgbsn5723827200 .
I think that investigating #12761 to see if we can eliminate the communication entirely is the best next step here.
There was a non-trivial performance regression for stream-promoted under gn-mpi as a result reworking iterator memory management (https://github.com/chapel-lang/chapel/pull/8073)
We missed this during our normal performance triage (we don't triage gn-mpi regularly), though it is visible: https://chapel-lang.org/perf/16-node-xc/?startdate=2018/01/06&enddate=2018/01/25&configs=gnuugniqthreads,gnugasnetmpi&graphs=hpccpromotedstreamperfgbsn5723827200
Since stream is such a simple yet fundamental/core benchmark, it would be good to understand the cause of the performance loss and address it.
It is worth noting that comm=ugni wasn't impacted, which is curious and may be helpful in tracking down the root cause of the regression (could be extra comm and ugni is fast enough that it doesn't show, or extra allocations and with the registered heap ugni+jemalloc won't return memory to the system so repeated allocations can be faster). It's also worth noting that stream-global wasn't impacted, so this is likely a code-path specific to how we build promotion wrappers.
Note that we discovered this after the fact in the release-over-release timings: