NaluCFD / Nalu

Nalu: a generalized unstructured massively parallel low Mach flow code designed to support a variety of open applications of interest built on the Sierra Toolkit and Trilinos solver Tpetra solver stack. The open source BSD, clause 3 license model has been chosen for the code base. See LICENSE for more information.
https://github.com/NaluCFD/Nalu
Other
142 stars 66 forks source link

ECP 5: Deploy production sliding mesh capability with linear solver benchmarking #5

Closed spdomin closed 7 years ago

spdomin commented 7 years ago

Activities:

  1. Improve baseline sliding mesh capability at curved surfaces.
  2. Evaluate ATDM-based parallel search methods.
  3. Establish matrix set-up cost timings.
  4. Evaluate possible lagging of matrix update.
  5. Evaluate reduction of matrix system by omitting moving block column entries in favor of multiple matrix assembly/solve iterations.
spdomin commented 7 years ago

@srajama1 I am working on obtaining a patch for the new ATDM-based search. Once I have that, you can help out on establishing search efficiencies.

srajama1 commented 7 years ago

Thanks, I was talking to Nate about it. Let me know how I can help.

spdomin commented 7 years ago

@mbarone81 let's add your overset work to this as well with the hope that this milestone will define the path forward for blade motion.

spdomin commented 7 years ago

@alanw0, take a look at commit dbd1b958a52f82b0d3209ccb4b4d7c621016e62d for a new test to start profiling for NonConformalManager ghosting costs. This should replace the effort on edgeContact3D.

alanw0 commented 7 years ago

ok got it, I'll take a look at the dgNonConformalEdgeCylinder test.

spdomin commented 7 years ago

@srajama1, could you please keep track of the ATDM-based search and test it once it is confirmed that point/box has been deployed? I need to start working on the kokkos algorithm structure task. Thanks.

spdomin commented 7 years ago

@NaluCFD/sliding I have the higher order DG scheme working. It also naturally allows for the P=1/P=2 interface. I will perform some more P=2 sliding mesh sims and commit soon.

spdomin commented 7 years ago

Hex8/Hex27 or Hex27/Hex27 is now completed: commit d2adbe82d786c3ac8fa9221730788923a6a184f9

spdomin commented 7 years ago

@NaluCFD/sliding, here is a sample timing for a 150 million element 1024 job (run 100 steps with two Picard loops).

32 node (36 core per node):

*******************************************************
Simulation Shall Complete: time/timestep: 0.0102493/100
*******************************************************
-------------------------------- 
Begin Timer Overview for Realm: realm_1
-------------------------------- 
Timing for Eq: myLowMach
             init --    avg: 0.000184042    min: 4.22001e-05    max: 0.00332427
         assemble --    avg: 0  min: 0  max: 0
    load_complete --    avg: 0  min: 0  max: 0
            solve --    avg: 0  min: 0  max: 0
    precond setup --    avg: 0  min: 0  max: 0
             misc --    avg: 18.7294    min: 15.4274    max: 19.9338
Timing for Eq: MomentumEQS
             init --    avg: 431.72     min: 428.433    max: 451.232
         assemble --    avg: 482.962    min: 465.899    max: 576.557
    load_complete --    avg: 123.853    min: 28.7052    max: 133.94
            solve --    avg: 583.758    min: 583.597    max: 589.672
    precond setup --    avg: 0.0177482  min: 0.011241   max: 0.0544529
             misc --    avg: 67.2906    min: 66.379     max: 68.3661
linear iterations --    avg: 11.79  min: 7  max: 34
Timing for Eq: ContinuityEQS
             init --    avg: 201.045    min: 183.683    max: 204.34
         assemble --    avg: 151.122    min: 142.574    max: 177.107
    load_complete --    avg: 30.9594    min: 4.86764    max: 33.8108
            solve --    avg: 3142.5     min: 3142.45    max: 3155.93
    precond setup --    avg: 22.3307    min: 22.329     max: 22.3356
             misc --    avg: 97.5644    min: 83.7688    max: 98.3748
linear iterations --    avg: 38.11  min: 27     max: 50
Timing for Eq: myZ
             init --    avg: 190.661    min: 190.292    max: 191.43
         assemble --    avg: 179.576    min: 160.99     max: 204.665
    load_complete --    avg: 30.0869    min: 4.99243    max: 32.84
            solve --    avg: 58.9669    min: 58.8958    max: 68.3101
    precond setup --    avg: 0.00417599     min: 0.00237584     max: 0.0218868
             misc --    avg: 18.8389    min: 18.4308    max: 19.505
linear iterations --    avg: 8.28   min: 6  max: 10
Timing for IO: 
   io create mesh --    avg: 0.363296   min: 0.191619   max: 0.527161
 io output fields --    avg: 57.5503    min: 56.8373    max: 58.4367
 io populate mesh --    avg: 4.6819     min: 4.6608     max: 4.70148
 io populate fd   --    avg: 0.256733   min: 0.0831389  max: 0.430451
Timing for connectivity/finalize lysys: 
         eqs init --    avg: 823.427    min: 820.799    max: 827.33
Timing for property evaluation:         
            props --    avg: 0.0918778  min: 0.0545573  max: 0.310776
Timing for Contact: 
       contact bc --    avg: 15.1264    min: 14.6959    max: 18.4114

Timing for Simulation: nprocs= 1152
           main() --    avg: 5880.26    min: 5840.17    max: 5887.04
Memory Overview: 
nalu memory: total (over all cores) current/high-water mark=       513.083 G      536.876 G
nalu memory:   min (over all cores) current/high-water mark=       256.641 M      266.148 M
nalu memory:   max (over all cores) current/high-water mark=       1.89328 G      2.04586 G
Min High-water memory usage 266.1 MB
Avg High-water memory usage 477.2 MB
Max High-water memory usage 2095.0 MB

Min Available memory per processor 1789.2 MB
Avg Available memory per processor 1789.2 MB
Max Available memory per processor 1789.2 MB

Min No-output time 5787.6 sec
Avg No-output time 5829.7 sec
Max No-output time 5833.2 sec

STKPERF: Total Time: 5841.7

STKPERF: Current memory: 357113856 (340.6 M)
STKPERF: Memory high water: 374874112 (357.5 M)

64 node (36 core per node):

*******************************************************
Simulation Shall Complete: time/timestep: 0.0102493/100
*******************************************************
-------------------------------- 
Begin Timer Overview for Realm: realm_1
-------------------------------- 
Timing for Eq: myLowMach
             init --    avg: 9.29431e-05    min: 3.31402e-05    max: 0.000857592
         assemble --    avg: 0  min: 0  max: 0
    load_complete --    avg: 0  min: 0  max: 0
            solve --    avg: 0  min: 0  max: 0
    precond setup --    avg: 0  min: 0  max: 0
             misc --    avg: 10.332     min: 7.72162    max: 11.2043
Timing for Eq: MomentumEQS
             init --    avg: 239.982    min: 237.399    max: 253.78
         assemble --    avg: 240.033    min: 231.406    max: 314.129
    load_complete --    avg: 97.0818    min: 21.3585    max: 102.162
            solve --    avg: 330.231    min: 330.093    max: 330.599
    precond setup --    avg: 0.00849527     min: 0.00510311     max: 0.0406508
             misc --    avg: 34.181     min: 33.6794    max: 34.9966
linear iterations --    avg: 12.285     min: 7  max: 34
Timing for Eq: ContinuityEQS
             init --    avg: 119.214    min: 106.893    max: 121.829
         assemble --    avg: 72.407     min: 70.6553    max: 93.4898
    load_complete --    avg: 24.731     min: 3.5701     max: 26.1621
            solve --    avg: 1910.76    min: 1910.73    max: 1911.08
    precond setup --    avg: 12.9936    min: 12.9926    max: 12.9988
             misc --    avg: 44.6586    min: 44.1545    max: 45.364
linear iterations --    avg: 42.01  min: 32     max: 50
Timing for Eq: myZ
             init --    avg: 108.232    min: 107.934    max: 108.653
         assemble --    avg: 81.3621    min: 79.6523    max: 101.46
    load_complete --    avg: 23.6941    min: 3.58118    max: 25.093
            solve --    avg: 35.8702    min: 35.8191    max: 36.088
    precond setup --    avg: 0.00200497     min: 0.00114703     max: 0.0113389
             misc --    avg: 9.74541    min: 9.52759    max: 10.3505
linear iterations --    avg: 9.445  min: 6  max: 10
Timing for IO: 
   io create mesh --    avg: 0.748922   min: 0.388414   max: 0.995598
 io output fields --    avg: 26.2067    min: 25.7432    max: 26.8175
 io populate mesh --    avg: 4.66858    min: 4.6314     max: 4.70515
 io populate fd   --    avg: 0.406266   min: 0.152544   max: 0.774392
Timing for connectivity/finalize lysys: 
         eqs init --    avg: 467.428    min: 465.682    max: 469.464
Timing for property evaluation:         
            props --    avg: 0.0555738  min: 0.0349991  max: 0.15305
Timing for Contact: 
       contact bc --    avg: 11.7654    min: 11.5483    max: 14.2255

Timing for Simulation: nprocs= 2304
           main() --    avg: 3422.38    min: 3401.18    max: 3425.24
Memory Overview: 
nalu memory: total (over all cores) current/high-water mark=       645.294 G      674.343 G
nalu memory:   min (over all cores) current/high-water mark=       185.172 M      193.027 M
nalu memory:   max (over all cores) current/high-water mark=       1.07852 G      1.14824 G
Min High-water memory usage 193.0 MB
Avg High-water memory usage 299.7 MB
Max High-water memory usage 1175.8 MB

Min Available memory per processor 1789.2 MB
Avg Available memory per processor 1789.2 MB
Max Available memory per processor 1789.2 MB

Min No-output time 3396.1 sec
Avg No-output time 3398.5 sec
Max No-output time 3401.0 sec

STKPERF: Total Time: 3420.3
alanw0 commented 7 years ago

It's interesting to notice the details of the timings, particularly the difference between min and max for particular lines which indicates imbalance, but it's hard to say whether it's an imbalance of the elements, or work (e.g. localized work like search/contact), or imbalance of ownership of shared nodes which would affect linear-solver work since owned nodes tend to correspond to number of matrix rows per proc.

In these timings the assemble looks pretty well balanced which may indicate the elements are well balanced. The solve time looks balanced but that could be because it includes sync points (like dots/norms) which forces the overall solve time to appear balanced. The load-complete time is distinctly imbalanced, which may be the most direct symptom of an imbalance among shared nodes causing uneven numbers of matrix rows per proc.

spdomin commented 7 years ago

Exactly. This is a hybrid mesh. In general, for these types of meshes we find almost perfect elemental balances while the node balance is generally poor. Aero found this as well and changed the manner by which node ownership is processed (round robin rather than lowest rank). We probably can consider something similar to make sure that the rows are well balanced.

spdomin commented 7 years ago

Latest push by @alanw0 provides the following differences:

First, the quantity of ghosting has gone down:

Old:

NonConformal alg will ghost a number of entities: 5285506

New:

NonConformal alg will ghost a new number of entities: 1242 and remove 12461 entities from ghosting. Timing also improved (see push):

https://github.com/NaluCFD/Nalu/commit/4cca5bae07624abdaf356063dd07d301473eecd8

spdomin commented 7 years ago

Transition to Jira.