Open jeff-cohere opened 4 years ago
@jeff-cohere It would be interesting to see how the initialization time is scaling with the number of grid cells.
@knepley, @jedbrown, Any ideas on how to decrease time spent in DMCreateMat
? The relevant code is at https://github.com/TDycores-Project/TDycore/blob/3bb9c8552ac3d8503d31cde13e26208ca08bd0dd/src/tdycore.c#L1423
On Mon, Oct 5, 2020 at 7:58 PM Gautam Bisht notifications@github.com wrote:
@jeff-cohere https://github.com/jeff-cohere It would be interesting to see how the initialization time is scaling with the number of grid cells.
@knepley https://github.com/knepley, @jedbrown https://github.com/jedbrown, Any ideas on how to decrease time spent in DMCreateMat? The relevant code is at https://github.com/TDycores-Project/TDycore/blob/3bb9c8552ac3d8503d31cde13e26208ca08bd0dd/src/tdycore.c#L1423
It should not take this much time. I will look at it.
Thanks,
Matt
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TDycores-Project/TDycore/issues/91#issuecomment-703950623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEORCOBG2PAROCMXRSPRYDSJJMS7ANCNFSM4SEDIQZA .
-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
https://www.cse.buffalo.edu/~knepley/ http://www.cse.buffalo.edu/~knepley/
Here are some numbers for runs at different resolutions:
Nx | Ny | Nz | TDyDriverInitializeTDy | DMCreateMat | TDyMPFAOInitialize |
---|---|---|---|---|---|
20 | 20 | 10 | 2.9 | 1.9 | 0.82 |
40 | 20 | 10 | 5.8 | 3.7 | 1.7 |
40 | 40 | 10 | 12 | 7.5 | 3.3 |
80 | 80 | 10 | 49 | 32 | 13 |
160 | 160 | 10 | 204 | 132 | 57 |
Looks like the initialization cost scales linearly with the problem size at these resolutions.
It is great that the initialization cost is scaling linearly, though DMCreateMatrix
is about 65% of the cost.
@jeff-cohere Any information regarding how the initialization scales in parallel, or did I miss something?
I haven't done parallel profiling yet. I think I ran into a problem running this demo in parallel at first. I'll try again and see how it works, and then I'll do a little "scaling study."
That said, it seems to me that we might be able to improve these numbers even in the serial case.
(deleted crash to reduce clutter in this issue--see #93 for details)
I will debug the error in parallel.
@jeff-cohere I just ran this on my laptop. The numbers do not agree with the above. You can see that preallocation takes 6s, and the major expense in SetUp is ComputeAinvB for some reason. Also, there are some bogus events, so I think something is slightly wrong with the profiling setup.
************************************************************************************************************************
*** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document ***
************************************************************************************************************************
---------------------------------------------- PETSc Performance Summary: ----------------------------------------------
./richards_driver on a arch-master-opt named MacBook-Pro.fios-router.home with 1 processor, by knepley Fri Oct 9 14:00:40 2020
Using Petsc Development GIT revision: v3.13.4-671-g69d777391a GIT Date: 2020-08-09 14:52:40 -0400
Max Max/Min Avg Total
Time (sec): 4.508e+02 1.000 4.508e+02
Objects: 1.510e+02 1.000 1.510e+02
Flop: 1.558e+10 1.000 1.558e+10 1.558e+10
Flop/sec: 3.456e+07 1.000 3.456e+07 3.456e+07
Memory: 7.979e+08 1.000 7.979e+08 7.979e+08
MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00
MPI Reductions: 0.000e+00 0.000
Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
e.g., VecAXPY() for real vectors of length N --> 2N flop
and VecAXPY() for complex vectors of length N --> 8N flop
Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --
Avg %Total Avg %Total Count %Total Avg %Total Count %Total
0: Main Stage: 1.2042e+02 26.7% 1.5562e+10 99.9% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
1: TDycore Setup: 3.3033e+02 73.3% 1.4400e+07 0.1% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%
------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
Count: number of times phase was executed
Time and Flop: Max - maximum over all processors
Ratio - ratio of maximum to minimum over all processors
Mess: number of messages sent
AvgLen: average message length (bytes)
Reduct: number of global reductions
Global: entire computation
Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
%T - percent time in this phase %F - percent flop in this phase
%M - percent messages in this phase %L - percent message lengths in this phase
%R - percent reductions in this phase
Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
BuildTwoSided 1 1.0 2.0813e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
DMPlexStratify 1 1.0 8.9059e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
DMPlexSymmetrize 1 1.0 3.2729e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFSetGraph 1 1.0 6.9652e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFSetUp 1 1.0 1.4129e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFBcastOpBegin 175 1.0 2.2303e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFBcastOpEnd 175 1.0 4.2343e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFPack 175 1.0 3.5286e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SFUnpack 175 1.0 2.4557e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatMult 924 1.0 3.5922e+00 1.0 5.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 38 0 0 0 3 38 0 0 0 1631
MatSolve 826 1.0 2.9674e+00 1.0 4.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 26 0 0 0 2 26 0 0 0 1356
MatLUFactorNum 77 1.0 3.0409e+00 1.0 1.66e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 3 11 0 0 0 546
MatILUFactorSym 1 1.0 1.5237e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyBegin 154 1.0 3.0756e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 154 1.0 4.1962e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetRowIJ 1 1.0 1.2875e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatGetOrdering 1 1.0 1.6529e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatZeroEntries 77 1.0 7.1313e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecDot 77 1.0 6.7339e-01 1.0 1.54e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 23
VecMDot 749 1.0 2.3295e-01 1.0 1.04e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 7 0 0 0 0 7 0 0 0 4446
VecNorm 995 1.0 3.0445e-02 1.0 1.99e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 1 0 0 0 0 1 0 0 0 6536
VecScale 898 1.0 7.1262e+00 1.0 8.28e+07 1.0 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 6 1 0 0 0 12
VecCopy 239 1.0 1.7945e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 292 1.0 1.9163e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecAXPY 77 1.0 6.7493e-01 1.0 1.54e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 23
VecWAXPY 78 1.0 8.4722e-03 1.0 7.90e+06 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 932
VecMAXPY 826 1.0 2.2327e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 8 0 0 0 0 8 0 0 0 5310
VecReduceArith 154 1.0 7.7238e-03 1.0 3.08e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 3988
VecReduceComm 77 1.0 8.3923e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecNormalize 826 1.0 7.1348e+00 1.0 2.48e+08 1.0 0.0e+00 0.0e+00 0.0e+00 2 2 0 0 0 6 2 0 0 0 35
ComputePlexGeometry 2 1.0 5.2172e-01 1.0 1.81e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 35
TDyResetDiscMethod 1 1.0 6.2351e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0
Warning -- total time of event greater than time of entire stage -- something is wrong with the timer
SNESSolve 20 1.0 1.2109e+02 1.0 1.55e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27100 0 0 0 101100 0 0 0 128
SNESSetUp 1 1.0 1.1328e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SNESFunctionEval 98 1.0 1.2278e+01 1.0 1.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 3 12 0 0 0 10 12 0 0 0 149
SNESJacobianEval 77 1.0 8.6496e+01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19 9 0 0 0 72 9 0 0 0 17
SNESLineSearch 77 1.0 1.0571e+01 1.0 1.92e+09 1.0 0.0e+00 0.0e+00 0.0e+00 2 12 0 0 0 9 12 0 0 0 181
KSPSetUp 77 1.0 4.1515e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
KSPSolve 77 1.0 1.9853e+01 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 0.0e+00 4 76 0 0 0 16 76 0 0 0 595
KSPGMRESOrthog 749 1.0 4.3466e-01 1.0 2.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00 0 13 0 0 0 0 13 0 0 0 4766
PCSetUp 77 1.0 3.8116e+00 1.0 1.66e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 11 0 0 0 3 11 0 0 0 436
PCApply 826 1.0 2.9680e+00 1.0 4.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00 1 26 0 0 0 2 26 0 0 0 1356
Warning -- total time of event greater than time of entire stage -- something is wrong with the timer
TDyTimeIntegRunToTime 1 1.0 1.2221e+02 1.0 1.55e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27100 0 0 0 101100 0 0 0 127
TDyPreSolveSNESSolver 20 1.0 9.1538e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0
TDyMPFAOSNESPreSolv3D 20 1.0 9.0719e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0
TDyMPFAOSNESFunc3D 98 1.0 1.0594e+01 1.0 1.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00 2 12 0 0 0 9 12 0 0 0 173
TDyMPFAOIFuncVert3D 98 1.0 1.7647e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 1 0 0 0 0 0
TDyMPFAOSNESJacob3D 77 1.0 8.5174e+01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19 9 0 0 0 71 9 0 0 0 17
TDyMPFAOIJacobVert3D 77 1.0 8.1565e+01 1.0 1.42e+09 1.0 0.0e+00 0.0e+00 0.0e+00 18 9 0 0 0 68 9 0 0 0 17
TDyPostSolvSNESSolver 20 1.0 1.2350e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
--- Event Stage 1: TDycore Setup
DMCreateMat 2 1.0 7.9277e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
DMPlexPrealloc 2 1.0 6.8801e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
MatAssemblyBegin 3 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
MatAssemblyEnd 3 1.0 1.2811e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecCopy 1 1.0 9.3937e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSet 7 1.0 4.8368e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
VecSetRandom 1 1.0 1.6370e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyDriverInitializeTDy 1 1.0 3.2491e+02 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00 72 0 0 0 0 98100 0 0 0 0
TDySetDiscMethod 1 1.0 5.5725e+00 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 2100 0 0 0 3
TDyResetDiscMethod 1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMPFAOInitialize 1 1.0 5.5725e+00 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 2100 0 0 0 3
TDySaveClosures 1 1.0 2.9792e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDySaveClosures_Cells 1 1.0 1.0283e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDySaveClosures_Elem 3 1.0 2.9781e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDySaveClosures_Vert 1 1.0 1.0362e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDySaveClosures_Faces 1 1.0 9.1378e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyGetNumCellVertClsr 1 1.0 8.1451e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMaxCellSharVert 2 1.0 1.5831e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMaxElmTypeSharElm 4 1.0 3.2248e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMaxFaceSharVert 1 1.0 8.1780e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMaxEdgeSharVert 1 1.0 8.2569e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyUnitNormalTri 1872000 1.0 3.4206e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyNormalTri 4272000 1.0 2.1912e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyNormalQuad 312000 1.0 3.4739e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyCmpVolTet 800000 1.0 4.6654e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyQuadArea 2400000 1.0 1.4797e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyTriangleArea 9600000 1.0 5.5103e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SetPermFromFunction 1 1.0 6.5720e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMPFAOAllocBdVal 1 1.0 2.5988e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMPFAOAllocSrcSnkVal 1 1.0 2.4414e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyMPFAOSetup 1 1.0 3.0668e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68 0 0 0 0 93 0 0 0 0 0
ComputeGMatrix 1 1.0 1.4023e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyCmpGMatrix3DMesh 1 1.0 1.4021e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
CmpTransMat 1 1.0 3.0654e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68 0 0 0 0 93 0 0 0 0 0
TDyCmpTransMat3DMesh 1 1.0 3.0654e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68 0 0 0 0 93 0 0 0 0 0
CmpTransMatBdVert 8 1.0 7.5817e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
CmpTransMatCrnVert112203 1.0 3.0642e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68 0 0 0 0 93 0 0 0 0 0
GetUpDnBdFaces 112203 1.0 8.8074e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
ComputeCandFmat 112203 1.0 2.2279e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
ExtrcsubFMatrices 224406 1.0 3.2624e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
ExtrcsubCMatrices 224406 1.0 8.6959e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
ComputeAinvB 112203 1.0 3.0217e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 67 0 0 0 0 91 0 0 0 0 0
TDyMPFAOSetFromOptions 1 1.0 8.1871e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
SetPoroFromFunction 1 1.0 2.8300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyCreateVectors 1 1.0 2.9350e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyCreateJacobian 1 1.0 7.9277e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 2 0 0 0 0 0
TDySetSNESFunction 1 1.0 9.8397e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDySetSNESJacobian 1 1.0 1.5974e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0
TDyUpdateState 118 1.0 5.4112e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 2 0 0 0 0 0
--- Event Stage 2: Unknown
--- Event Stage 3: Unknown
Let me know if you'd like me to profile with more parallelism (I have a a 64-core node that's convenient for such things).
Thanks, guys. Matt, thanks for running this. I've not used PETSc's profiling tools before, so I'm not surprised to learn that I'm doing it wrong. :-) How did you get your performance report?
You can use -log_view. The names are a bit long, so I chopped them all down to get this view. You might consider changing them in the source.
Okay, after updating to PETSc release 3.14 and building optimized(!), I've reproduced results more in line with what Matt has shown, though I don't get the messages about mismatched total times, and some of the times seem pretty different. Maybe I'll put together a script that generates easily-interpreted output that we can all run to make sure we're doing the same thing.
All right. Here's how things are looking for runs of 20x20x10 up through 160x160x10 using 1 to 4 processes (run until simulation time t = 3, so mostly problem setup). PETSc 3.14, built optimized. It seems like the setup stuff is scaling linearly, though the TDyDriverInitializeTDy
call looks like it scales slightly worse.
I can investigate on another box tomorrow with some larger runs. Does this look right to everyone?
Just some notes here from Gautam for myself, since I'm getting back to this task:
export N=100
./transient_snes_mpfaof90 \
-tdy_water_density exponential \
-tdy_mpfao_gmatrix_method MPFAO_GMATRIX_TPF \
-tdy_mpfao_boundary_condition_type MPFAO_NEUMANN_BC \
-max_steps 1 -ic_value 90325 \
-nx $N -ny $N -nz $N
The physical problem setup (i.e. perm. value, BC) is not the same for two models. We are only interested in performance of model setup, so the differences in problem setup can be neglected.
Latest update (@bishtgautam , you have these results already--I'm just sticking them here): seems like we're spending most of our time computing cell geometry. Maybe we can take a look and see if we can avoid recomputing where it's unnecessary.
I've added some additional timers in the mpfao-init-profiling
branch.
Running:
mpiexec -n 8 ./transient_snes_mpfaof90 \
-tdy_water_density exponential \
-tdy_mpfao_gmatrix_method MPFAO_GMATRIX_TPF \
-tdy_mpfao_boundary_condition_type MPFAO_NEUMANN_BC \
-max_steps 1 -ic_value 90325 \
-nx 100 -ny 100 -nz 10 \
-tdy_timers
Profiling results (running top20
with tdyprof
):
Stage Name Event Name Time FLOP
Main Stage ComputePlexGeometry 116.53 2.58312e+06
Main Stage DMPlexComputeCellGeometryFVM 116.482 2.58312e+06
Main Stage TDySetupNumericalMethods 88.3628 1.92744e+06
TDycore Setup TDyMPFAOInitialize 88.3627 1.92744e+06
TDycore Setup TDyBuildMesh 80.4514 2.05462e+06
TDycore Setup SetupSubcellsFor3DMesh 79.6353 2.04336e+06
TDycore Setup DMPlexComputeCellGeometryFVM (SetupSubce 78.4626 2.04336e+06
Main Stage SNESSolve 70.9354 3.27254e+10
Main Stage KSPSolve 63.7017 3.26548e+10
Main Stage DMPlexDistribute 36.8913 0.
Main Stage Mesh Partition 34.7262 0.
Main Stage DMPlexDistOvrlp 32.461 0.
Main Stage DMPlexPartStrtSF 32.1641 0.
Main Stage MatMult 25.2142 1.00584e+10
Main Stage PCApply 23.3853 9.60692e+09
Main Stage MatSolve 22.7797 9.88347e+09
Main Stage KSPGMRESOrthog 14.7776 1.2208e+10
Main Stage TDyCreateJacobian 10.3166 0.
Main Stage DMCreateMat 10.3165 0.
Main Stage DMPlexPrealloc 10.3097 0.
Hmm, it seems like ComputeCellGeometryFVM() is taking a very long time. That is very strange. Is it being called once? Also, it seems like all the time in distribution comes from stratifying the SF. @jedbrown maybe we need to look at that. I will try to get a clean timing like this in a PETSc example.
That would be really helpful, @knepley. We can run tests on Noether (64 cores with good debug env).
One thing that would be helpful is to include the event count with the timing summary above.
One thing that would be helpful is to include the event count with the timing summary above.
Sure. I'm not near the machine that produced these timings today, but I'll try to get you a report with event counts soon.
@knepley DMPlexComputeCellGeometryFVM
is being called cell-by-cell
But do we only loop over cells once?
We only loop over all cells once.
@jeff-cohere When you get a chance, could you rebase mpfao-init-profiling
? thanks.
@jeff-cohere When you get a chance, could you rebase
mpfao-init-profiling
? thanks.
Rebased.
This has been on our radar for a while, so I'm creating an issue to track progress. Now that we've instrumented TDycore with timers, I'm able to see where the time is being spent in initialization.
Profiling Notes
I'm doing profiling in the
jeff-cohere/mpfao_init_profiling
branch. Here's what I'm running to profile the initialization issue:This runs a short-ish simulation with timers turned on. The resulting profile log,
tdycore_profile.csv
, can be loaded into a spreadsheet (or Pandas dataframe or whatever). So far, it looks like we're spending a lot of initialization time in these functions:TDyDriverInitializeTDy
(74 sec)TDyCreateJacobian
(48 sec)DMCreateMat
(48 sec)DMPlexPrealloc
(48 sec)TDySetDiscretizationMethod
(21 sec)TDyMPFAOInitialize
(21 sec)The preallocation entry is telling. If we're not giving PETSc any clues about the non-zero structure matrix in our preallocation of the Jacobian, PETSc's probably doing a lot of work to figure it out on its own. My guess is that we can pass along some information to help it out. I've never used the DMPlex interface, though, so I'll have to look into what this means.
FYI @bishtgautam