ECP-copa / Cabana

Performance-portable library for particle-based simulations
Other
197 stars 51 forks source link

User ScatterView #124

Closed sslattery closed 5 years ago

sslattery commented 5 years ago

There were several instances in the code where atomics were used. Almost all instances of atomics have been replaced with Kokkos::ScatterView where possible to enhance performance on non-GPU architectures. There are still several instances of Kokkos::atomic_fetch_add in the code but these are needed to implement the current algorithms and therefore they cannot be changed without some algorithmic research.

Closes #82

codecov-io commented 5 years ago

Codecov Report

Merging #124 into master will increase coverage by 0.1%. The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #124     +/-   ##
========================================
+ Coverage    57.5%   57.6%   +0.1%     
========================================
  Files          38      38             
  Lines        2273    2282      +9     
========================================
+ Hits         1307    1316      +9     
  Misses        966     966
Flag Coverage Δ
#clang 67.8% <100%> (+0.1%) :arrow_up:
#doxygen 18.3% <ø> (ø) :arrow_up:
#gcc 97.5% <100%> (ø) :arrow_up:
Impacted Files Coverage Δ
core/src/Cabana_CommunicationPlan.hpp 94.2% <100%> (+0.1%) :arrow_up:
core/src/Cabana_Halo.hpp 95.4% <100%> (ø) :arrow_up:
core/src/Cabana_LinkedCellList.hpp 89.1% <100%> (+0.2%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 70179e8...b49a406. Read the comment docs.

sslattery commented 5 years ago

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

sslattery commented 5 years ago

Ok here are some speedups on Summit on communication plan construction where I expected them. The speedup listed is the fraction over 1 that was improved. So 0.53 means a 53% speedup and -0.013 would mean a 1.3% slow down:

Version 0 - 64 Node   Version 1- 64 Node   Speed Up
         
cuda_host_distributor_fast_create   cuda_host_distributor_fast_create    
num_rank ave num_rank ave  
384 90471.3 384 52303.6 0.42187633
384 89138.9 384 51588.1 0.421261649
384 88945.9 384 51603.2 0.419836103
384 88367.8 384 51286.1 0.419629096
384 84571.4 384 49329.4 0.416712979
384 80224.6 384 47086 0.413072798
384 52433.2 384 35578.3 0.321454727
         
cuda_host_distributor_slow_create   cuda_host_distributor_slow_create    
num_rank ave num_rank ave  
384 94856 384 58933.8 0.378702454
384 94638.9 384 59018.6 0.376381171
384 94584.7 384 58586.8 0.380589038
384 93999.9 384 58849.6 0.37393976
384 91489.8 384 57068.2 0.376234291
384 90421.3 384 53945.7 0.403396102
384 71974.5 384 49011 0.319050497

So pretty good speedups here. Also saw the same speed ups for the halo construction and no slow downs elsewhere

rfbird commented 5 years ago

I didn't profile but for past codes I have profiled I have never had a case where the duplication from ScatterView didn't perform better than using atomics on a CPU. On a GPU it defaults to atomics which is our preferred behavior.

FWIW I've seen cases where our hand written implementation is definitely faster than ScatterView, but I agree I fully expect ScatterView to be faster than atomics on CPU for any sane use case

sslattery commented 5 years ago

yeah so actually after investigating the duplication from scatter view was slower in the halo scatter - in that case there is often so few collisions that the extra duplication became an overhead. therefore I removed it.

sslattery commented 5 years ago

also on our MD neighbor list benchmark with parameters for water molecules I see a small speed up of a few percent which demonstrates the changes to the LinkedCellList. You can use these settings for water:

./NeighborListMDPerfTest 3.0 1000000 1.0

So, small improvement but something.

sslattery commented 5 years ago

Yes - if we find places where, for example, the Halo scatter operation has tons of collisions and could use duplication on the CPU then we could add back in ScatterView and develop some collision metric based on the communication plan (because we know all collisions once the plan is generated) to pick the implementation at run time.

guangyechen commented 5 years ago

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

rfbird commented 5 years ago

@rfbird when you said that handwritten one faster than scatter_view, is it for cpu or gpu, or both?

Only CPU (there basically no overhead to ScatterView on GPU)

guangyechen commented 5 years ago

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

sslattery commented 5 years ago

Yes the overhead would be from that as well as the recombination at the end.

rfbird commented 5 years ago

May the overhead of scatter_view be creating/destroying temporary arrays on the fly?

I think it's nothing to be overly concerned about right now. Two things:

1) My comparison wasn't apple to apples, so the data is only so-so 2) Kokkos actually has a performance test for this, which means they're aware and happy with the status of the implementation: https://github.com/kokkos/kokkos/blob/master/containers/performance_tests/TestScatterView.hpp

stanmoore1 commented 5 years ago

I've tested ScatterView in ExaMiniMD, see https://github.com/ECP-copa/ExaMiniMD/pull/21. There was little difference between creating temporary ScatterViews vs using a persistent allocation. We also compared performance to a hand-coded version, and for 1D arrays it was close. There may be more optimizations needed for 2D arrays though.

stanmoore1 commented 5 years ago

FYI, see Christian's comment at the end of https://github.com/kokkos/kokkos/issues/1390.

rfbird commented 5 years ago

FYI, see Christian's comment at the end of kokkos/kokkos#1390.

Good point. overall I'm happy to rely on this (as I also do for a LANL production app..), hopefully it will only get better! If we see it really causing an issue I'll do something similar to Christians' comments in #1390 and file a PR..

sslattery commented 5 years ago

Agreed - and we have our own performance tests to track these changes (although not automated yet). I'm very happy with a 2x speedup for the communication stuff.