CEED / libCEED

CEED Library: Code for Efficient Extensible Discretizations
https://libceed.org
BSD 2-Clause "Simplified" License
199 stars 46 forks source link

GPU: persist kernels that will be reused in nonlinear and transient problems. #805

Closed jedbrown closed 2 years ago

jedbrown commented 3 years ago

Notably, grid transfer operators and diagonal assembly. Running (for example) the solid mechanics example with a breakpoint in CeedCompileCuda and collecting stack traces is informative. For example, this occurs on every level and every Newton step.

#0  CeedCompileCuda (ceed=0x555555ca17c0, source=source@entry=0x7ffff5597fa0 "extern \"C\" __global__ void noTrStrided(const CeedInt nelem, const CeedScalar *__restrict__ u,
 CeedScalar *__restrict__ v) { for (CeedInt node = blockIdx.x * blockDim.x + threadIdx.x; node < nelem*REST"..., module=0x55559c14bd50, numopts=numopts@entry=8) at /home/jed
/src/libceed/backends/cuda/ceed-cuda.c:93
#1  0x00007ffff54e5eb8 in CeedElemRestrictionCreate_Cuda (mtype=CEED_MEM_HOST, cmode=<optimized out>, indices=0x0, r=0x55559c562aa0) at /home/jed/src/libceed/backends/cuda/c
eed-cuda-restriction.c:440
#2  0x00007ffff54be790 in CeedElemRestrictionCreateStrided (ceed=0x555555ca17c0, num_elem=2880, elem_size=27, num_comp=num_comp@entry=81, l_size=6298560, strides=strides@ent
ry=0x7fffffffb75c, rstr=0x7fffffffb848) at /home/jed/src/libceed/interface/ceed-elemrestriction.c:435
#3  0x00007ffff54e2ec1 in CeedOperatorLinearAssembleQFunction_Cuda (op=<optimized out>, assembled=0x7fffffffb840, rstr=0x7fffffffb848, request=<optimized out>) at /home/jed/
src/libceed/backends/cuda/ceed-cuda-operator.c:615
#4  0x00007ffff54c5118 in CeedOperatorLinearAssembleQFunction (op=op@entry=0x55559c5283f0, assembled=assembled@entry=0x7fffffffb840, rstr=rstr@entry=0x7fffffffb848, request=
request@entry=0x7ffff5755608 <ceed_request_immediate>) at /home/jed/src/libceed/interface/ceed-preconditioning.c:1060
#5  0x00007ffff54e1189 in CeedOperatorAssembleDiagonalCore_Cuda (op=op@entry=0x55559c5283f0, assembled=assembled@entry=0x55559c5e2260, request=request@entry=0x7ffff5755608 <
ceed_request_immediate>, pointBlock=pointBlock@entry=false) at /home/jed/src/libceed/backends/cuda/ceed-cuda-operator.c:1081
#6  0x00007ffff54e296a in CeedOperatorLinearAssembleAddDiagonal_Cuda (op=0x55559c5283f0, assembled=0x55559c5e2260, request=0x7ffff5755608 <ceed_request_immediate>) at /home/
jed/src/libceed/backends/cuda/ceed-cuda-operator.c:1184
#7  0x00007ffff54c7006 in CeedOperatorLinearAssembleDiagonal (op=0x55559c5283f0, assembled=0x55559c5e2260, request=0x7ffff5755608 <ceed_request_immediate>) at /home/jed/src/
libceed/interface/ceed-preconditioning.c:1102
#8  0x0000555555574ada in GetDiag_Ceed (A=<optimized out>, D=0x55559c7021b0) at /home/jed/src/libceed/examples/solids/src/matops.c:216
#9  0x00007ffff6604056 in MatGetDiagonal_Shell (A=0x55559c1d3c70, v=0x55559c7021b0) at /home/jed/petsc/src/mat/impls/shell/shell.c:1173
#10 0x00007ffff64d6a0e in MatGetDiagonal (mat=<optimized out>, v=0x55559c7021b0) at /home/jed/petsc/src/mat/interface/matrix.c:4930
#11 0x00007ffff6bd3665 in PCSetUp_Jacobi (pc=pc@entry=0x55559c4231b0) at /home/jed/petsc/src/ksp/pc/impls/jacobi/jacobi.c:191
#12 0x00007ffff6bd3d61 in PCSetUp_Jacobi_NonSymmetric (pc=0x55559c4231b0) at /home/jed/petsc/src/ksp/pc/impls/jacobi/jacobi.c:271
#13 PCApply_Jacobi (pc=0x55559c4231b0, x=0x55559c56eb20, y=0x55559c010630) at /home/jed/petsc/src/ksp/pc/impls/jacobi/jacobi.c:294
#14 0x00007ffff6b95a6b in PCApply (pc=0x55559c4231b0, x=0x55559c56eb20, y=y@entry=0x55559c010630) at /home/jed/petsc/src/ksp/pc/interface/precon.c:445
#15 0x00007ffff6d742d4 in KSP_PCApply (ksp=0x55559c24ca10, x=<optimized out>, y=0x55559c010630) at /home/jed/petsc/include/petsc/private/kspimpl.h:382
[...]

We need to audit this and persist kernels as appropriate so that we only need to compile specialized kernels once provided the shapes don't change.

nbeams commented 3 years ago

I'm confused -- in lines 1088-1092 of ceed-cuda-operator.c, in the functionCeedOperatorAssembleDiagonalCore_Cuda, it checks whether the diagonal operator has already been built, and only calls the setup function (where the compilation occurs) if it hasn't:

// Setup
if (!impl->diag) {
    ierr = CeedOperatorAssembleDiagonalSetup_Cuda(op, pointBlock);
    CeedChkBackend(ierr);
}
jedbrown commented 3 years ago

The first issue is a few lines above, namely the CeedOperatorLinearAssembleQFunction just above that guard.

static inline int CeedOperatorAssembleDiagonalCore_Cuda(CeedOperator op,
    CeedVector assembled, CeedRequest *request, const bool pointBlock) {
  int ierr;
  Ceed ceed;
  ierr = CeedOperatorGetCeed(op, &ceed); CeedChkBackend(ierr);
  CeedOperator_Cuda *impl;
  ierr = CeedOperatorGetData(op, &impl); CeedChkBackend(ierr);

  // Assemble QFunction
  CeedVector assembledqf;
  CeedElemRestriction rstr;
  ierr = CeedOperatorLinearAssembleQFunction(op,  &assembledqf, &rstr, request);
  CeedChkBackend(ierr);
  ierr = CeedElemRestrictionDestroy(&rstr); CeedChkBackend(ierr);
  CeedScalar maxnorm = 0;
  ierr = CeedVectorNorm(assembledqf, CEED_NORM_MAX, &maxnorm);
  CeedChkBackend(ierr);

  // Setup
  if (!impl->diag) {
    ierr = CeedOperatorAssembleDiagonalSetup_Cuda(op, pointBlock);
    CeedChkBackend(ierr);
  }
nbeams commented 3 years ago

Oh, I see, that explains the element restriction stuff showing up in the compilation call as well.

jeremylt commented 2 years ago

With #811 is there more we want to do here?

jedbrown commented 2 years ago

Are there currently any kernel rebuilds within the fluids time loop or solids load increments (or CUDA or ROCm)? We can close this once we've checked that they're gone.

jeremylt commented 2 years ago

In solids/Ratel we're good. I can check for fluids.

jeremylt commented 2 years ago

Good on fluids:

compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
0 TS dt 1e-07 time 0.
1 TS dt 1e-06 time 1e-07
2 TS dt 1e-05 time 1.1e-06
3 TS dt 3.61077e-05 time 1.11e-05
4 TS dt 3.62582e-05 time 4.72077e-05
5 TS dt 3.68712e-05 time 8.34659e-05
6 TS dt 3.74745e-05 time 0.000120337
7 TS dt 3.80842e-05 time 0.000157812
8 TS dt 3.87062e-05 time 0.000195896
9 TS dt 3.93464e-05 time 0.000234602
10 TS dt 4.00104e-05 time 0.000273948

good on Ratel

compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
0 TS dt 0.1 time 0.
    0 SNES Function norm 1.319444444444e-02 
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
compiling
    1 SNES Function norm 3.053486454374e-02 
    2 SNES Function norm 1.043668772223e-03 
    3 SNES Function norm 1.717678106611e-05 
    4 SNES Function norm 1.614817227388e-10 
    5 SNES Function norm 3.769096552398e-16 
1 TS dt 0.1 time 0.1
    0 SNES Function norm 1.319444444444e-02 
    1 SNES Function norm 3.036654411669e-02 
    2 SNES Function norm 1.057633854960e-03 
    3 SNES Function norm 1.893185192170e-05 
    4 SNES Function norm 1.648310078930e-10 
    5 SNES Function norm 5.775915382844e-16 
2 TS dt 0.1 time 0.2
    0 SNES Function norm 1.319444444444e-02 
    1 SNES Function norm 3.000629493173e-02 
    2 SNES Function norm 1.056396162693e-03 
    3 SNES Function norm 1.929385269493e-05 
    4 SNES Function norm 1.672885067132e-10 
    5 SNES Function norm 9.717309069193e-16 
3 TS dt 0.1 time 0.3
    0 SNES Function norm 1.319444444444e-02 
    1 SNES Function norm 2.919679969063e-02 
    2 SNES Function norm 1.006984843377e-03 
    3 SNES Function norm 1.689161370771e-05 
    4 SNES Function norm 1.351257433559e-10 
    5 SNES Function norm 1.238094526520e-15 
4 TS dt 0.1 time 0.4
    0 SNES Function norm 1.319444444444e-02 
    1 SNES Function norm 2.775711886243e-02 
    2 SNES Function norm 8.863569803524e-04 
    3 SNES Function norm 1.213544060421e-05 
    4 SNES Function norm 7.372986610904e-11 
5 TS dt 0.1 time 0.5
    0 SNES Function norm 1.319444446375e-02 
    1 SNES Function norm 2.565751791823e-02 
    2 SNES Function norm 6.993092433606e-04 
    3 SNES Function norm 7.022824735838e-06 
    4 SNES Function norm 2.551068010392e-11 
6 TS dt 0.1 time 0.6
    0 SNES Function norm 1.319444445258e-02 
    1 SNES Function norm 2.301164630316e-02 
    2 SNES Function norm 4.827125094351e-04 
    3 SNES Function norm 3.313828245655e-06 
    4 SNES Function norm 5.467409174031e-12 
7 TS dt 0.1 time 0.7
    0 SNES Function norm 1.319444444672e-02 
    1 SNES Function norm 2.004140370955e-02 
    2 SNES Function norm 2.882154092733e-04 
    3 SNES Function norm 1.311231278011e-06 
    4 SNES Function norm 7.112126786343e-13 
8 TS dt 0.1 time 0.8
    0 SNES Function norm 1.319444444482e-02 
    1 SNES Function norm 1.701238222008e-02 
    2 SNES Function norm 1.493504680447e-04 
    3 SNES Function norm 4.393354545156e-07 
    4 SNES Function norm 5.644082863586e-14 
9 TS dt 0.1 time 0.9
    0 SNES Function norm 1.319444444448e-02 
    1 SNES Function norm 1.414861611133e-02 
    2 SNES Function norm 6.809871803714e-05 
    3 SNES Function norm 1.204273119829e-07 
    4 SNES Function norm 3.239667923424e-15 
10 TS dt 0.1 time 1.
    0 SNES Function norm 1.319444444445e-02 
    1 SNES Function norm 1.158419360419e-02 
    2 SNES Function norm 2.870337960955e-05 
    3 SNES Function norm 2.439533010569e-08 
    4 SNES Function norm 3.009243269173e-15 
11 TS dt 0.1 time 1.1
jedbrown commented 2 years ago

Great, thanks!