Closed sslattery closed 5 years ago
Merging #124 into master will increase coverage by
0.1%
. The diff coverage is100%
.
@@ Coverage Diff @@
## master #124 +/- ##
========================================
+ Coverage 57.5% 57.6% +0.1%
========================================
Files 38 38
Lines 2273 2282 +9
========================================
+ Hits 1307 1316 +9
Misses 966 966
Flag | Coverage Δ | |
---|---|---|
#clang | 67.8% <100%> (+0.1%) |
:arrow_up: |
#doxygen | 18.3% <ø> (ø) |
:arrow_up: |
#gcc | 97.5% <100%> (ø) |
:arrow_up: |
Impacted Files | Coverage Δ | |
---|---|---|
core/src/Cabana_CommunicationPlan.hpp | 94.2% <100%> (+0.1%) |
:arrow_up: |
core/src/Cabana_Halo.hpp | 95.4% <100%> (ø) |
:arrow_up: |
core/src/Cabana_LinkedCellList.hpp | 89.1% <100%> (+0.2%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 70179e8...b49a406. Read the comment docs.
I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView
didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.
Ok here are some speedups on Summit on communication plan construction where I expected them. The speedup listed is the fraction over 1 that was improved. So 0.53
means a 53% speedup and -0.013
would mean a 1.3% slow down:
Version 0 - 64 Node | Version 1- 64 Node | Speed Up | ||
---|---|---|---|---|
cuda_host_distributor_fast_create | cuda_host_distributor_fast_create | |||
num_rank | ave | num_rank | ave | |
384 | 90471.3 | 384 | 52303.6 | 0.42187633 |
384 | 89138.9 | 384 | 51588.1 | 0.421261649 |
384 | 88945.9 | 384 | 51603.2 | 0.419836103 |
384 | 88367.8 | 384 | 51286.1 | 0.419629096 |
384 | 84571.4 | 384 | 49329.4 | 0.416712979 |
384 | 80224.6 | 384 | 47086 | 0.413072798 |
384 | 52433.2 | 384 | 35578.3 | 0.321454727 |
cuda_host_distributor_slow_create | cuda_host_distributor_slow_create | |||
num_rank | ave | num_rank | ave | |
384 | 94856 | 384 | 58933.8 | 0.378702454 |
384 | 94638.9 | 384 | 59018.6 | 0.376381171 |
384 | 94584.7 | 384 | 58586.8 | 0.380589038 |
384 | 93999.9 | 384 | 58849.6 | 0.37393976 |
384 | 91489.8 | 384 | 57068.2 | 0.376234291 |
384 | 90421.3 | 384 | 53945.7 | 0.403396102 |
384 | 71974.5 | 384 | 49011 | 0.319050497 |
So pretty good speedups here. Also saw the same speed ups for the halo construction and no slow downs elsewhere
I didn't profile but for past codes I have profiled I have never had a case where the duplication from
ScatterView
didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.
FWIW I've seen cases where our hand written implementation is definitely faster than ScatterView
, but I agree I fully expect ScatterView
to be faster than atomics on CPU for any sane use case
yeah so actually after investigating the duplication from scatter view was slower in the halo scatter - in that case there is often so few collisions that the extra duplication became an overhead. therefore I removed it.
also on our MD neighbor list benchmark with parameters for water molecules I see a small speed up of a few percent which demonstrates the changes to the LinkedCellList
. You can use these settings for water:
./NeighborListMDPerfTest 3.0 1000000 1.0
So, small improvement but something.
Yes - if we find places where, for example, the Halo scatter operation has tons of collisions and could use duplication on the CPU then we could add back in ScatterView
and develop some collision metric based on the communication plan (because we know all collisions once the plan is generated) to pick the implementation at run time.
@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?
@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?
Only CPU (there basically no overhead to ScatterView on GPU)
May the overhead of scatter_view be creating/destroying temporary arrays on the fly?
Yes the overhead would be from that as well as the recombination at the end.
May the overhead of scatter_view be creating/destroying temporary arrays on the fly?
I think it's nothing to be overly concerned about right now. Two things:
1) My comparison wasn't apple to apples, so the data is only so-so 2) Kokkos actually has a performance test for this, which means they're aware and happy with the status of the implementation: https://github.com/kokkos/kokkos/blob/master/containers/performance_tests/TestScatterView.hpp
I've tested ScatterView in ExaMiniMD, see https://github.com/ECP-copa/ExaMiniMD/pull/21. There was little difference between creating temporary ScatterViews vs using a persistent allocation. We also compared performance to a hand-coded version, and for 1D arrays it was close. There may be more optimizations needed for 2D arrays though.
FYI, see Christian's comment at the end of https://github.com/kokkos/kokkos/issues/1390.
FYI, see Christian's comment at the end of kokkos/kokkos#1390.
Good point. overall I'm happy to rely on this (as I also do for a LANL production app..), hopefully it will only get better! If we see it really causing an issue I'll do something similar to Christians' comments in #1390 and file a PR..
Agreed - and we have our own performance tests to track these changes (although not automated yet). I'm very happy with a 2x speedup for the communication stuff.
There were several instances in the code where atomics were used. Almost all instances of atomics have been replaced with
Kokkos::ScatterView
where possible to enhance performance on non-GPU architectures. There are still several instances ofKokkos::atomic_fetch_add
in the code but these are needed to implement the current algorithms and therefore they cannot be changed without some algorithmic research.Closes #82