sslattery commented 5 years ago

There were several instances in the code where atomics were used. Almost all instances of atomics have been replaced with Kokkos::ScatterView where possible to enhance performance on non-GPU architectures. There are still several instances of Kokkos::atomic_fetch_add in the code but these are needed to implement the current algorithms and therefore they cannot be changed without some algorithmic research.

Closes #82

codecov-io commented 5 years ago

Codecov Report

Merging #124 into master will increase coverage by 0.1%. The diff coverage is 100%.

@@           Coverage Diff            @@
##           master    #124     +/-   ##
========================================
+ Coverage    57.5%   57.6%   +0.1%     
========================================
  Files          38      38             
  Lines        2273    2282      +9     
========================================
+ Hits         1307    1316      +9     
  Misses        966     966

Flag	Coverage Δ
#clang	`67.8% <100%> (+0.1%)`	:arrow_up:
#doxygen	`18.3% <ø> (ø)`	:arrow_up:
#gcc	`97.5% <100%> (ø)`	:arrow_up:

Impacted Files	Coverage Δ
core/src/Cabana_CommunicationPlan.hpp	`94.2% <100%> (+0.1%)`	:arrow_up:
core/src/Cabana_Halo.hpp	`95.4% <100%> (ø)`	:arrow_up:
core/src/Cabana_LinkedCellList.hpp	`89.1% <100%> (+0.2%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 70179e8...b49a406. Read the comment docs.

sslattery commented 5 years ago

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

sslattery commented 5 years ago

Ok here are some speedups on Summit on communication plan construction where I expected them. The speedup listed is the fraction over 1 that was improved. So 0.53 means a 53% speedup and -0.013 would mean a 1.3% slow down:

Version 0 - 64 Node		Version 1- 64 Node		Speed Up

cuda_host_distributor_fast_create		cuda_host_distributor_fast_create
num_rank	ave	num_rank	ave
384	90471.3	384	52303.6	0.42187633
384	89138.9	384	51588.1	0.421261649
384	88945.9	384	51603.2	0.419836103
384	88367.8	384	51286.1	0.419629096
384	84571.4	384	49329.4	0.416712979
384	80224.6	384	47086	0.413072798
384	52433.2	384	35578.3	0.321454727

cuda_host_distributor_slow_create		cuda_host_distributor_slow_create
num_rank	ave	num_rank	ave
384	94856	384	58933.8	0.378702454
384	94638.9	384	59018.6	0.376381171
384	94584.7	384	58586.8	0.380589038
384	93999.9	384	58849.6	0.37393976
384	91489.8	384	57068.2	0.376234291
384	90421.3	384	53945.7	0.403396102
384	71974.5	384	49011	0.319050497

So pretty good speedups here. Also saw the same speed ups for the halo construction and no slow downs elsewhere

rfbird commented 5 years ago

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

FWIW I've seen cases where our hand written implementation is definitely faster than ScatterView, but I agree I fully expect ScatterView to be faster than atomics on CPU for any sane use case

sslattery commented 5 years ago

yeah so actually after investigating the duplication from scatter view was slower in the halo scatter - in that case there is often so few collisions that the extra duplication became an overhead. therefore I removed it.

sslattery commented 5 years ago

also on our MD neighbor list benchmark with parameters for water molecules I see a small speed up of a few percent which demonstrates the changes to the LinkedCellList. You can use these settings for water:

./NeighborListMDPerfTest 3.0 1000000 1.0

So, small improvement but something.

sslattery commented 5 years ago

Yes - if we find places where, for example, the Halo scatter operation has tons of collisions and could use duplication on the CPU then we could add back in ScatterView and develop some collision metric based on the communication plan (because we know all collisions once the plan is generated) to pick the implementation at run time.

guangyechen commented 5 years ago

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

rfbird commented 5 years ago

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

Only CPU (there basically no overhead to ScatterView on GPU)

guangyechen commented 5 years ago

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

sslattery commented 5 years ago

Yes the overhead would be from that as well as the recombination at the end.

rfbird commented 5 years ago

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

I think it's nothing to be overly concerned about right now. Two things:

1) My comparison wasn't apple to apples, so the data is only so-so 2) Kokkos actually has a performance test for this, which means they're aware and happy with the status of the implementation: https://github.com/kokkos/kokkos/blob/master/containers/performance_tests/TestScatterView.hpp

stanmoore1 commented 5 years ago

I've tested ScatterView in ExaMiniMD, see https://github.com/ECP-copa/ExaMiniMD/pull/21. There was little difference between creating temporary ScatterViews vs using a persistent allocation. We also compared performance to a hand-coded version, and for 1D arrays it was close. There may be more optimizations needed for 2D arrays though.

stanmoore1 commented 5 years ago

FYI, see Christian's comment at the end of https://github.com/kokkos/kokkos/issues/1390.

rfbird commented 5 years ago

FYI, see Christian's comment at the end of kokkos/kokkos#1390.

Good point. overall I'm happy to rely on this (as I also do for a LANL production app..), hopefully it will only get better! If we see it really causing an issue I'll do something similar to Christians' comments in #1390 and file a PR..

sslattery commented 5 years ago

Agreed - and we have our own performance tests to track these changes (although not automated yet). I'm very happy with a 2x speedup for the communication stuff.

ECP-copa / Cabana

User ScatterView #124

Codecov Report