Exawind / nalu-wind

Solver for wind farm simulations targeting exascale computational platforms
https://exawind.github.io/nalu-wind/
Other
125 stars 85 forks source link

Factor out matrix graph construction from LinearSystem class #456

Closed jhux2 closed 2 months ago

jhux2 commented 4 years ago

The various linear system construction phases each create a matrix graph. This is redundant work, as almost all phases use the same connectivity graph. One exception is the "monolithic" momentum linear system.

This issue will track progress in factoring out the graph construction so that it is done as few times as possible per time step.

jhux2 commented 4 years ago

@alanw0 Unit tests are passing with your fix, thank you.

jhux2 commented 4 years ago

I have been looking at the unStableEdge runs more closely. On the CrsGraph refactor branch, NaluWind only takes three nonlinear iterations in the first time step, whereas NaluWind master takes four.

Within the first three nonlinear iterations, the linear solver histories are identical between the two code branches. I've also increased the linear solver verbosity so that I can see global matrix statistics. The CrsGraph refactor matrices have more entries, which is expected. MueLu also does the right thing by dropping out the zero entries when coarsening.

All of this leads me to believe the linear systems are being assembled correctly and consistently. But obviously there's a difference.

@rcknaus @sayerhs Any suggestions?

ablUnstableEdge-gold-2timestep-verbose.log.txt ablUnstableEdge-crsgraph-2timestep-verbose.log.txt

rcknaus commented 4 years ago

@jhux2 It looks like a problem with the "EquationSystem::system_is_converged" check to me. Continuity and enthalpy meet the convergence test after 3 iterations but momentum doesn't, so it should take 4 iterations for the first step. I looked at the code briefly, but didn't have any ideas as to why this would have changed. I'll take another look at it, but it seems related to that

jhux2 commented 4 years ago

@rcknaus Thanks for the hint. The nonlinear convergence tolerance wasn't being communicated properly. With that and a couple other fixes, the set of failing tests has been narrowed to these:

6: FAIL: airfoilRANSEdgeTrilinos.................     4.4961s 1.4686e-01 2.2813e-07
22: FAIL: ductWedge...............................     0.6109s 5.2951e-11 4.2101e-07
23: FAIL: edgeHybridFluids........................    34.2733s 3.1472e-14 3.5561e-11
24: FAIL: edgePipeCHT.............................     4.9603s 1.8287e-12 2.3133e-12
25: FAIL: ekmanSpiral.............................    11.2711s 9.8148e-07 8.9743e-05
43: FAIL: karmanVortex............................     0.4997s 2.5751e-08 9.4357e-07
50: FAIL: nonIsoEdgeOpenJet.......................    12.9589s 5.6479e-11 1.5776e-07
53: FAIL: nonIsoNonUniformEdgeOpenJet.............    34.3867s 4.0505e-08 2.2234e-06
rcknaus commented 4 years ago

Those diffs look typical of order-of-operations changes to me.

jhux2 commented 4 years ago

@rcknaus Thanks. I will wait to hear from @sayerhs before I merge.

rcknaus commented 4 years ago

I noticed a performance regression related to this with some of the "edge-based" tests, running on my laptop. it looks like it's present on the dashboard too,

https://my.cdash.org/testDetails.php?test=56232170&build=1802978 [display graphs -> test time]

It looks like ~50% for some of the simpler edge based tests.

jhux2 commented 4 years ago

@rcknaus Yes, I see it in my local builds, as well. It's not algorithmic -- before/after the PR merge, all linear iterations are identical for ablNeutralEdge. The performance difference is evident even in the matrix-vector product:

Before CrsGraph PR

Timer Name                      MinOverProcs     MeanOverProcs     MaxOverProcs
Belos: Operation Op*x           2.7279 (1243)    2.8273 (1243)     2.9788 (1243)

After CrsGraph PR

Belos: Operation Op*x          7.6261 (1243)     8.1160 (1243)     8.3396 (1243)

I also see a difference in the ablNeutralEdgeSegregated. Before CrsGraph PR

Belos: Operation Op*x           1.4314 (1638)    1.4610 (1638)     1.4943 (1638)

After CrsGraph PR

Belos: Operation Op*x           3.7999 (1638)    3.9999 (1638)     4.1238 (1638) 
sayerhs commented 4 years ago

@jhux2 Interesting. So the larger graph with the extra zeros is having a performance impact? Shall we revert this? Since this will have a big impact on production runs.

I wonder if we should consider only doing edges and not full element for volume? And if that'll restore the performance somewhat. We will still need to do the full face elem graph, but hopefully that's a much smaller overhead.

jhux2 commented 4 years ago

Yes, I agree this should be reverted until the performance issues can be resolved.

I wonder if we should consider only doing edges and not full element for volume? And if that'll restore the performance somewhat. We will still need to do the full face elem graph, but hopefully that's a much smaller overhead.

So this would be require separate code paths that depend on the discretization, as well as maintaining an additional graph for the element graph?

sayerhs commented 4 years ago

@jhux2 if the realm.realmUsesEdges_ == true then we activate buildEdgeToNodeGraph and not buildElemToNodeGraph. I will need to check, but I am fairly certain there are no use cases where the user sets use_edges: yes in input file and still ends up needing an element based graph.

rcknaus commented 4 years ago

We just mix and match on the boundaries for the graph, so I think just the buildFaceElem connections should be needed for use_edges: yes.

jhux2 commented 4 years ago

We just mix and match on the boundaries for the most part for the graph, so I think just the buildFaceElem connections should be needed for use_edges: yes.

I guess this is what you were saying in this comment.