TDycores-Project / TDycore

BSD 2-Clause "Simplified" License
4 stars 0 forks source link

MPFAO model initialization takes a long time. #91

Open jeff-cohere opened 4 years ago

jeff-cohere commented 4 years ago

This has been on our radar for a while, so I'm creating an issue to track progress. Now that we've instrumented TDycore with timers, I'm able to see where the time is being spent in initialization.

Profiling Notes

I'm doing profiling in the jeff-cohere/mpfao_init_profiling branch. Here's what I'm running to profile the initialization issue:

cd demo/richards
make
./richards_driver -dim 3 -Nx 100 -Ny 100 -Nz 10 -tdy_timers -final_time 300

This runs a short-ish simulation with timers turned on. The resulting profile log, tdycore_profile.csv, can be loaded into a spreadsheet (or Pandas dataframe or whatever). So far, it looks like we're spending a lot of initialization time in these functions:

The preallocation entry is telling. If we're not giving PETSc any clues about the non-zero structure matrix in our preallocation of the Jacobian, PETSc's probably doing a lot of work to figure it out on its own. My guess is that we can pass along some information to help it out. I've never used the DMPlex interface, though, so I'll have to look into what this means.

FYI @bishtgautam

bishtgautam commented 4 years ago

@jeff-cohere It would be interesting to see how the initialization time is scaling with the number of grid cells.

@knepley, @jedbrown, Any ideas on how to decrease time spent in DMCreateMat? The relevant code is at https://github.com/TDycores-Project/TDycore/blob/3bb9c8552ac3d8503d31cde13e26208ca08bd0dd/src/tdycore.c#L1423

knepley commented 4 years ago

On Mon, Oct 5, 2020 at 7:58 PM Gautam Bisht notifications@github.com wrote:

@jeff-cohere https://github.com/jeff-cohere It would be interesting to see how the initialization time is scaling with the number of grid cells.

@knepley https://github.com/knepley, @jedbrown https://github.com/jedbrown, Any ideas on how to decrease time spent in DMCreateMat? The relevant code is at https://github.com/TDycores-Project/TDycore/blob/3bb9c8552ac3d8503d31cde13e26208ca08bd0dd/src/tdycore.c#L1423

It should not take this much time. I will look at it.

Thanks,

 Matt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TDycores-Project/TDycore/issues/91#issuecomment-703950623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEORCOBG2PAROCMXRSPRYDSJJMS7ANCNFSM4SEDIQZA .

-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ http://www.cse.buffalo.edu/~knepley/

jeff-cohere commented 4 years ago

Here are some numbers for runs at different resolutions:

Nx Ny Nz TDyDriverInitializeTDy DMCreateMat TDyMPFAOInitialize
20 20 10 2.9 1.9 0.82
40 20 10 5.8 3.7 1.7
40 40 10 12 7.5 3.3
80 80 10 49 32 13
160 160 10 204 132 57

Looks like the initialization cost scales linearly with the problem size at these resolutions.

bishtgautam commented 4 years ago

It is great that the initialization cost is scaling linearly, though DMCreateMatrix is about 65% of the cost.

ghammond86 commented 4 years ago

@jeff-cohere Any information regarding how the initialization scales in parallel, or did I miss something?

jeff-cohere commented 4 years ago

I haven't done parallel profiling yet. I think I ran into a problem running this demo in parallel at first. I'll try again and see how it works, and then I'll do a little "scaling study."

That said, it seems to me that we might be able to improve these numbers even in the serial case.

jeff-cohere commented 4 years ago

(deleted crash to reduce clutter in this issue--see #93 for details)

bishtgautam commented 4 years ago

I will debug the error in parallel.

knepley commented 4 years ago

@jeff-cohere I just ran this on my laptop. The numbers do not agree with the above. You can see that preallocation takes 6s, and the major expense in SetUp is ComputeAinvB for some reason. Also, there are some bogus events, so I think something is slightly wrong with the profiling setup.

************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r -fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: ----------------------------------------------

./richards_driver on a arch-master-opt named MacBook-Pro.fios-router.home with 1 processor, by knepley Fri Oct  9 14:00:40 2020
Using Petsc Development GIT revision: v3.13.4-671-g69d777391a  GIT Date: 2020-08-09 14:52:40 -0400

                         Max       Max/Min     Avg       Total
Time (sec):           4.508e+02     1.000   4.508e+02
Objects:              1.510e+02     1.000   1.510e+02
Flop:                 1.558e+10     1.000   1.558e+10  1.558e+10
Flop/sec:             3.456e+07     1.000   3.456e+07  3.456e+07
Memory:               7.979e+08     1.000   7.979e+08  7.979e+08
MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       0.000e+00     0.000

Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N flop
                            and VecAXPY() for complex vectors of length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     Avg         %Total    Count   %Total
 0:      Main Stage: 1.2042e+02  26.7%  1.5562e+10  99.9%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%
 1:   TDycore Setup: 3.3033e+02  73.3%  1.4400e+07   0.1%  0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                              --- Global ---  --- Stage ----  Total
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          1 1.0 2.0813e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMPlexStratify         1 1.0 8.9059e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
DMPlexSymmetrize       1 1.0 3.2729e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetGraph             1 1.0 6.9652e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFSetUp                1 1.0 1.4129e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFBcastOpBegin       175 1.0 2.2303e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFBcastOpEnd         175 1.0 4.2343e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFPack               175 1.0 3.5286e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SFUnpack             175 1.0 2.4557e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatMult              924 1.0 3.5922e+00 1.0 5.86e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 38  0  0  0   3 38  0  0  0  1631
MatSolve             826 1.0 2.9674e+00 1.0 4.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 26  0  0  0   2 26  0  0  0  1356
MatLUFactorNum        77 1.0 3.0409e+00 1.0 1.66e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 11  0  0  0   3 11  0  0  0   546
MatILUFactorSym        1 1.0 1.5237e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyBegin     154 1.0 3.0756e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd       154 1.0 4.1962e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetRowIJ            1 1.0 1.2875e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatGetOrdering         1 1.0 1.6529e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatZeroEntries        77 1.0 7.1313e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecDot                77 1.0 6.7339e-01 1.0 1.54e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0    23
VecMDot              749 1.0 2.3295e-01 1.0 1.04e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  7  0  0  0   0  7  0  0  0  4446
VecNorm              995 1.0 3.0445e-02 1.0 1.99e+08 1.0 0.0e+00 0.0e+00 0.0e+00  0  1  0  0  0   0  1  0  0  0  6536
VecScale             898 1.0 7.1262e+00 1.0 8.28e+07 1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   6  1  0  0  0    12
VecCopy              239 1.0 1.7945e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet               292 1.0 1.9163e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecAXPY               77 1.0 6.7493e-01 1.0 1.54e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0    23
VecWAXPY              78 1.0 8.4722e-03 1.0 7.90e+06 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0   932
VecMAXPY             826 1.0 2.2327e-01 1.0 1.19e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0  8  0  0  0   0  8  0  0  0  5310
VecReduceArith       154 1.0 7.7238e-03 1.0 3.08e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0  3988
VecReduceComm         77 1.0 8.3923e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecNormalize         826 1.0 7.1348e+00 1.0 2.48e+08 1.0 0.0e+00 0.0e+00 0.0e+00  2  2  0  0  0   6  2  0  0  0    35
ComputePlexGeometry    2 1.0 5.2172e-01 1.0 1.81e+07 1.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0    35
TDyResetDiscMethod     1 1.0 6.2351e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
Warning -- total time of event greater than time of entire stage -- something is wrong with the timer
SNESSolve             20 1.0 1.2109e+02 1.0 1.55e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27100  0  0  0 101100  0  0  0   128
SNESSetUp              1 1.0 1.1328e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SNESFunctionEval      98 1.0 1.2278e+01 1.0 1.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00  3 12  0  0  0  10 12  0  0  0   149
SNESJacobianEval      77 1.0 8.6496e+01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19  9  0  0  0  72  9  0  0  0    17
SNESLineSearch        77 1.0 1.0571e+01 1.0 1.92e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2 12  0  0  0   9 12  0  0  0   181
KSPSetUp              77 1.0 4.1515e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
KSPSolve              77 1.0 1.9853e+01 1.0 1.18e+10 1.0 0.0e+00 0.0e+00 0.0e+00  4 76  0  0  0  16 76  0  0  0   595
KSPGMRESOrthog       749 1.0 4.3466e-01 1.0 2.07e+09 1.0 0.0e+00 0.0e+00 0.0e+00  0 13  0  0  0   0 13  0  0  0  4766
PCSetUp               77 1.0 3.8116e+00 1.0 1.66e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 11  0  0  0   3 11  0  0  0   436
PCApply              826 1.0 2.9680e+00 1.0 4.03e+09 1.0 0.0e+00 0.0e+00 0.0e+00  1 26  0  0  0   2 26  0  0  0  1356
Warning -- total time of event greater than time of entire stage -- something is wrong with the timer
TDyTimeIntegRunToTime  1 1.0 1.2221e+02 1.0 1.55e+10 1.0 0.0e+00 0.0e+00 0.0e+00 27100  0  0  0 101100  0  0  0   127
TDyPreSolveSNESSolver 20 1.0 9.1538e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
TDyMPFAOSNESPreSolv3D 20 1.0 9.0719e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
TDyMPFAOSNESFunc3D    98 1.0 1.0594e+01 1.0 1.83e+09 1.0 0.0e+00 0.0e+00 0.0e+00  2 12  0  0  0   9 12  0  0  0   173
TDyMPFAOIFuncVert3D   98 1.0 1.7647e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   1  0  0  0  0     0
TDyMPFAOSNESJacob3D   77 1.0 8.5174e+01 1.0 1.43e+09 1.0 0.0e+00 0.0e+00 0.0e+00 19  9  0  0  0  71  9  0  0  0    17
TDyMPFAOIJacobVert3D  77 1.0 8.1565e+01 1.0 1.42e+09 1.0 0.0e+00 0.0e+00 0.0e+00 18  9  0  0  0  68  9  0  0  0    17
TDyPostSolvSNESSolver 20 1.0 1.2350e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0

--- Event Stage 1: TDycore Setup

DMCreateMat            2 1.0 7.9277e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
DMPlexPrealloc         2 1.0 6.8801e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
MatAssemblyBegin       3 1.0 4.0531e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
MatAssemblyEnd         3 1.0 1.2811e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecCopy                1 1.0 9.3937e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSet                 7 1.0 4.8368e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
VecSetRandom           1 1.0 1.6370e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyDriverInitializeTDy 1 1.0 3.2491e+02 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00 72  0  0  0  0  98100  0  0  0     0
TDySetDiscMethod       1 1.0 5.5725e+00 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2100  0  0  0     3
TDyResetDiscMethod     1 1.0 0.0000e+00 0.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOInitialize     1 1.0 5.5725e+00 1.0 1.44e+07 1.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2100  0  0  0     3
TDySaveClosures        1 1.0 2.9792e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Cells  1 1.0 1.0283e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Elem   3 1.0 2.9781e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Vert   1 1.0 1.0362e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySaveClosures_Faces  1 1.0 9.1378e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyGetNumCellVertClsr  1 1.0 8.1451e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxCellSharVert     2 1.0 1.5831e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxElmTypeSharElm   4 1.0 3.2248e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxFaceSharVert     1 1.0 8.1780e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMaxEdgeSharVert     1 1.0 8.2569e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyUnitNormalTri 1872000 1.0 3.4206e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyNormalTri     4272000 1.0 2.1912e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyNormalQuad     312000 1.0 3.4739e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCmpVolTet      800000 1.0 4.6654e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyQuadArea      2400000 1.0 1.4797e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyTriangleArea  9600000 1.0 5.5103e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SetPermFromFunction    1 1.0 6.5720e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOAllocBdVal     1 1.0 2.5988e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOAllocSrcSnkVal 1 1.0 2.4414e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyMPFAOSetup          1 1.0 3.0668e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
ComputeGMatrix         1 1.0 1.4023e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCmpGMatrix3DMesh    1 1.0 1.4021e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
CmpTransMat            1 1.0 3.0654e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
TDyCmpTransMat3DMesh   1 1.0 3.0654e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
CmpTransMatBdVert      8 1.0 7.5817e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
CmpTransMatCrnVert112203 1.0 3.0642e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 68  0  0  0  0  93  0  0  0  0     0
GetUpDnBdFaces    112203 1.0 8.8074e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ComputeCandFmat   112203 1.0 2.2279e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ExtrcsubFMatrices 224406 1.0 3.2624e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ExtrcsubCMatrices 224406 1.0 8.6959e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
ComputeAinvB      112203 1.0 3.0217e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 67  0  0  0  0  91  0  0  0  0     0
TDyMPFAOSetFromOptions 1 1.0 8.1871e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
SetPoroFromFunction    1 1.0 2.8300e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCreateVectors       1 1.0 2.9350e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyCreateJacobian      1 1.0 7.9277e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  2  0  0  0  0   2  0  0  0  0     0
TDySetSNESFunction     1 1.0 9.8397e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDySetSNESJacobian     1 1.0 1.5974e-05 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
TDyUpdateState       118 1.0 5.4112e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00  1  0  0  0  0   2  0  0  0  0     0

--- Event Stage 2: Unknown

--- Event Stage 3: Unknown
jedbrown commented 4 years ago

Let me know if you'd like me to profile with more parallelism (I have a a 64-core node that's convenient for such things).

jeff-cohere commented 4 years ago

Thanks, guys. Matt, thanks for running this. I've not used PETSc's profiling tools before, so I'm not surprised to learn that I'm doing it wrong. :-) How did you get your performance report?

knepley commented 4 years ago

You can use -log_view. The names are a bit long, so I chopped them all down to get this view. You might consider changing them in the source.

jeff-cohere commented 4 years ago

Okay, after updating to PETSc release 3.14 and building optimized(!), I've reproduced results more in line with what Matt has shown, though I don't get the messages about mismatched total times, and some of the times seem pretty different. Maybe I'll put together a script that generates easily-interpreted output that we can all run to make sure we're doing the same thing.

jeff-cohere commented 4 years ago

All right. Here's how things are looking for runs of 20x20x10 up through 160x160x10 using 1 to 4 processes (run until simulation time t = 3, so mostly problem setup). PETSc 3.14, built optimized. It seems like the setup stuff is scaling linearly, though the TDyDriverInitializeTDy call looks like it scales slightly worse.

I can investigate on another box tomorrow with some larger runs. Does this look right to everyone?

jeff-cohere commented 3 years ago

Just some notes here from Gautam for myself, since I'm getting back to this task:

  1. To run a large with TDycore:
export N=100
./transient_snes_mpfaof90 \
-tdy_water_density                 exponential \
-tdy_mpfao_gmatrix_method          MPFAO_GMATRIX_TPF \
-tdy_mpfao_boundary_condition_type MPFAO_NEUMANN_BC \
-max_steps 1 -ic_value 90325 \
-nx $N -ny $N -nz $N
  1. PFLOTRAN Problem setup -pflotranin performance.in Modify NXYZ to increase DOFs.

The physical problem setup (i.e. perm. value, BC) is not the same for two models. We are only interested in performance of model setup, so the differences in problem setup can be neglected.

jeff-cohere commented 3 years ago

Latest update (@bishtgautam , you have these results already--I'm just sticking them here): seems like we're spending most of our time computing cell geometry. Maybe we can take a look and see if we can avoid recomputing where it's unnecessary.

I've added some additional timers in the mpfao-init-profiling branch.

Running:

mpiexec -n 8 ./transient_snes_mpfaof90 \
-tdy_water_density                 exponential \
-tdy_mpfao_gmatrix_method          MPFAO_GMATRIX_TPF \
-tdy_mpfao_boundary_condition_type MPFAO_NEUMANN_BC \
-max_steps 1 -ic_value 90325 \
-nx 100 -ny 100 -nz 10 \
-tdy_timers

Profiling results (running top20 with tdyprof):

          Stage Name                               Event Name             Time             FLOP
          Main Stage                      ComputePlexGeometry           116.53      2.58312e+06
          Main Stage             DMPlexComputeCellGeometryFVM          116.482      2.58312e+06
          Main Stage                 TDySetupNumericalMethods          88.3628      1.92744e+06
       TDycore Setup                       TDyMPFAOInitialize          88.3627      1.92744e+06
       TDycore Setup                             TDyBuildMesh          80.4514      2.05462e+06
       TDycore Setup                   SetupSubcellsFor3DMesh          79.6353      2.04336e+06
       TDycore Setup DMPlexComputeCellGeometryFVM (SetupSubce          78.4626      2.04336e+06
          Main Stage                                SNESSolve          70.9354      3.27254e+10
          Main Stage                                 KSPSolve          63.7017      3.26548e+10
          Main Stage                         DMPlexDistribute          36.8913               0.
          Main Stage                           Mesh Partition          34.7262               0.
          Main Stage                          DMPlexDistOvrlp           32.461               0.
          Main Stage                         DMPlexPartStrtSF          32.1641               0.
          Main Stage                                  MatMult          25.2142      1.00584e+10
          Main Stage                                  PCApply          23.3853      9.60692e+09
          Main Stage                                 MatSolve          22.7797      9.88347e+09
          Main Stage                           KSPGMRESOrthog          14.7776       1.2208e+10
          Main Stage                        TDyCreateJacobian          10.3166               0.
          Main Stage                              DMCreateMat          10.3165               0.
          Main Stage                           DMPlexPrealloc          10.3097               0.
knepley commented 3 years ago

Hmm, it seems like ComputeCellGeometryFVM() is taking a very long time. That is very strange. Is it being called once? Also, it seems like all the time in distribution comes from stratifying the SF. @jedbrown maybe we need to look at that. I will try to get a clean timing like this in a PETSc example.

jedbrown commented 3 years ago

That would be really helpful, @knepley. We can run tests on Noether (64 cores with good debug env).

jeff-cohere commented 3 years ago

Here are the calls to DMPlexComputeCellGeometryFVM:

knepley commented 3 years ago

One thing that would be helpful is to include the event count with the timing summary above.

jeff-cohere commented 3 years ago

One thing that would be helpful is to include the event count with the timing summary above.

Sure. I'm not near the machine that produced these timings today, but I'll try to get you a report with event counts soon.

bishtgautam commented 3 years ago

@knepley DMPlexComputeCellGeometryFVM is being called cell-by-cell

knepley commented 3 years ago

But do we only loop over cells once?

bishtgautam commented 3 years ago

We only loop over all cells once.

bishtgautam commented 3 years ago

@jeff-cohere When you get a chance, could you rebase mpfao-init-profiling? thanks.

jeff-cohere commented 3 years ago

@jeff-cohere When you get a chance, could you rebase mpfao-init-profiling? thanks.

Rebased.