Exawind / openturbine

A flexible multibody structural dynamics code for wind turbines
https://exawind.github.io/openturbine/
MIT License
14 stars 5 forks source link

Rework System Assembly to Remove Final Dense Matrix #191

Closed ddement closed 4 days ago

ddement commented 3 weeks ago

Task: Rework the Integrate* kernels to only perform element-by-element integration, with the resulting element matrices contributed to the global sparse matrix outside of the kernel. Fuse the adjacent Integrate* calls to enable optimization.

Why: The current Integrate* kernels perform both intra-element integration and inter-element assembly, which requires a large, dense matrix of size num_system_dofs x num_system_dofs. For the 300 element rotor, this is 80% of our memory usage. By splitting integration and assembly, the large dense matrix will be replaced by a (usually) much smaller matrix (num_elems x max_elem_dof x max_elem_dof). By fusing adjacent Integrate* kernels, we may also be able to improve performance of these steps and optimize our use of atomics.

Done means: The K_dense matrix is no longer needed for system assembly and Integrate* kernels have been appropriately fused. A new performance profile has been generated to understand memory usage and performance changes. A good faith effort to optimize the new integration kernels has been made, and a follow-on story generated if need be.

ddement commented 3 weeks ago

The first step of generating a new performance profile is done. Profile should be compared to the one from the Sparse Solver issue. In particular, total performance is about the same, while memory in CUDA space is 800MB rather than 3.8GB (a 79% reduction). Next some performance tweaking will be explored on the new Integrate Kernels:

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 4.11378 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 3.86e+00 sec 93.9% 20.6% 0.0% 0.0% 7.32e+02 11 Step [region]
|   |-> 3.07e+00 sec 74.6% 1.5% 0.0% 1.6% 3.37e+02 22 Solve System [region]
|   |   |-> 2.97e+00 sec 72.3% 0.0% 0.0% 0.6% 2.22e+01 22 Sparse Solver [region]
|   |   |   |-> 2.28e+00 sec 55.5% 0.0% 0.0% 100.0% 0.00e+00 22 Numeric Factorization [region]
|   |   |   |-> 6.32e-01 sec 15.4% 0.2% 0.0% 99.8% 1.04e+02 22 Symbolic Factorization [region]
|   |   |   |-> 4.21e-02 sec 1.0% 0.0% 0.0% 100.0% 0.00e+00 22 Solve [region]
|   |   |-> 3.00e-02 sec 0.7% 100.0% 0.0% ------ 44 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |-> 7.41e-01 sec 18.0% 96.8% 0.0% 3.1% 8.31e+02 22 Assemble System [region]
|   |   |-> 4.82e-01 sec 11.7% 100.0% 0.0% 0.0% 4.56e+01 22 Assemble Stiffness Matrix [region]
|   |   |   |-> 4.82e-01 sec 11.7% 100.0% 0.0% ------ 22 IntegrateStiffnessMatrix [for]
|   |   |-> 1.47e-01 sec 3.6% 99.9% 0.0% 0.1% 1.50e+02 22 Assemble Inertia Matrix [region]
|   |   |   |-> 1.47e-01 sec 3.6% 100.0% 0.0% ------ 22 IntegrateInertiaMatrix [for]
|   |   |-> 2.46e-02 sec 0.6% 100.0% 0.0% ------ 22 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |   |-> 1.53e-02 sec 0.4% 100.0% 0.0% ------ 44 ContributeElementsToSparseMatrix [for]
|   |   |-> 1.09e-02 sec 0.3% 100.0% 0.0% ------ 22 sort_crs_matrix [for]
|   |   |-> 1.05e-02 sec 0.3% 100.0% 0.0% ------ 22 KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY_SPREADTEAM [for]
|   |   |-> 5.56e-03 sec 0.1% 100.0% 0.0% ------ 66 Kokkos::ViewFill-1D [for]
|   |   |-> 5.42e-03 sec 0.1% 100.0% 0.0% ------ 44 KokkosKernels::Common::PrefixSum [scan]
|   |-> 2.97e-02 sec 0.7% 41.8% 0.0% 58.2% 1.70e+04 22 Assemble Constraints [region]
|   |-> 1.91e-02 sec 0.5% 91.0% 0.0% 9.0% 2.88e+04 22 Update State [region]
|   |   |-> 5.06e-03 sec 0.1% 100.0% 0.0% ------ 44 RotateSectionMatrix [for]
|-> 3.65e-02 sec 0.9% 100.0% 0.0% ------ 1 PopulateSparseRowPtrs_Constraints_Transpose [for]
|-> 1.35e-02 sec 0.3% 100.0% 0.0% ------ 1 PopulateSparseIndices [for]
|-> 5.05e-03 sec 0.1% 100.0% 0.0% ------ 3300 Kokkos::View::initialization [u_mirror] via memset [for]

BOTTOM-UP TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 2.28e+00 sec 55.5% 0.0% 0.0% 0.0% 0.00e+00 22 Numeric Factorization [region]
|   |-> 2.28e+00 sec 55.5% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|       |-> 2.28e+00 sec 55.5% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|           |-> 2.28e+00 sec 55.5% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 6.31e-01 sec 15.3% 0.0% 0.0% 0.0% 0.00e+00 22 Symbolic Factorization [region]
|   |-> 6.31e-01 sec 15.3% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|       |-> 6.31e-01 sec 15.3% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|           |-> 6.31e-01 sec 15.3% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 4.82e-01 sec 11.7% 100.0% 0.0% ------ 22 IntegrateStiffnessMatrix [for]
|   |-> 4.82e-01 sec 11.7% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Stiffness Matrix [region]
|       |-> 4.82e-01 sec 11.7% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|           |-> 4.82e-01 sec 11.7% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.47e-01 sec 3.6% 100.0% 0.0% ------ 22 IntegrateInertiaMatrix [for]
|   |-> 1.47e-01 sec 3.6% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Inertia Matrix [region]
|       |-> 1.47e-01 sec 3.6% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|           |-> 1.47e-01 sec 3.6% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 5.46e-02 sec 1.3% 100.0% 0.0% ------ 66 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |-> 3.00e-02 sec 0.7% 100.0% 0.0% 0.0% 0.00e+00 44 Solve System [region]
|   |   |-> 3.00e-02 sec 0.7% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|   |-> 2.46e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|       |-> 2.46e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 4.98e-02 sec 1.2% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|   |-> 4.98e-02 sec 1.2% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 4.21e-02 sec 1.0% 0.0% 0.0% 0.0% 0.00e+00 22 Solve [region]
|   |-> 4.21e-02 sec 1.0% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|       |-> 4.21e-02 sec 1.0% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|           |-> 4.21e-02 sec 1.0% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 3.65e-02 sec 0.9% 100.0% 0.0% ------ 1 PopulateSparseRowPtrs_Constraints_Transpose [for]
|-> 2.33e-02 sec 0.6% 0.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |-> 2.33e-02 sec 0.6% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.73e-02 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 22 Assemble Constraints [region]
|   |-> 1.73e-02 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.65e-02 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|   |-> 1.65e-02 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|       |-> 1.65e-02 sec 0.4% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.53e-02 sec 0.4% 100.0% 0.0% ------ 44 ContributeElementsToSparseMatrix [for]
|   |-> 1.53e-02 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 44 Assemble System [region]
|       |-> 1.53e-02 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 1.35e-02 sec 0.3% 100.0% 0.0% ------ 1 PopulateSparseIndices [for]
|-> 1.13e-02 sec 0.3% 100.0% 0.0% ------ 44 sort_crs_matrix [for]
|   |-> 1.09e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |   |-> 1.09e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.05e-02 sec 0.3% 100.0% 0.0% ------ 22 KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY_SPREADTEAM [for]
|   |-> 1.05e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|       |-> 1.05e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 7.53e-03 sec 0.2% 100.0% 0.0% ------ 110 KokkosKernels::Common::PrefixSum [scan]
|   |-> 5.42e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Assemble System [region]
|   |   |-> 5.42e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 7.48e-03 sec 0.2% 100.0% 0.0% ------ 135 Kokkos::ViewFill-1D [for]
|   |-> 5.56e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 66 Assemble System [region]
|   |   |-> 5.56e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 66 Step [region]
|-> 5.30e-03 sec 0.1% 100.0% 0.0% ------ 46 RotateSectionMatrix [for]
|   |-> 5.30e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 46 Update State [region]
|       |-> 5.06e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 5.21e-03 sec 0.1% 100.0% 0.0% ------ 66 KokkosSparse::SpAdd::Symbolic::InputSorted::CountEntries [for]
|-> 5.05e-03 sec 0.1% 100.0% 0.0% ------ 3300 Kokkos::View::initialization [u_mirror] via memset [for]

KOKKOS HOST SPACE:
===================
MAX MEMORY ALLOCATED: 84290.8 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/mat_nzvals
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/mat_colind
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/colind
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/nzval_tmp
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/indices_tmp
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/host_nzvals_view_
  12.3% Step/Solve System/values_mirror
  6.2% Step/Solve System/Sparse Solver/Symbolic Factorization/host_rows_view_
  6.2% Step/Solve System/columnIndices_mirror
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/rowPtrsUnpacked_host_
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/pointers_tmp
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/rowptr
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/lgMap_mirror
  0.2% Step/Solve System/x_mirror
  0.2% Step/Solve System/lgMap_mirror
  0.2% Step/Solve System/b_mirror
  0.1% Step/Solve System/Sparse Solver/Symbolic Factorization/host_col_ptr_view_

KOKKOS CUDA SPACE:
===================
MAX MEMORY ALLOCATED: 795689.6 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  37.3% Step/Assemble System/CopyTangentIntoSparseMatrix/Kokkos::CudaSpace::TeamScratchMemory
  35.0% gradient_matrix
  7.9% Step/Assemble System/pool data
  1.3% Step/Solve System/values
  1.3% Step/Solve System/values
  1.3% matrix_terms
  1.3% K values
  1.3% Step/Assemble System/valuesC
  1.3% Step/Assemble System/values
  0.7% Step/Solve System/entries
  0.6% Step/Solve System/entries
  0.6% indices
  0.6% Step/Assemble System/entriesC
  0.6% Step/Assemble System/entries
  0.5% qp_Mstar
  0.5% qp_Cstar
  0.5% qp_RR0
  0.5% qp_Muu
  0.5% qp_Cuu
  0.5% qp_Ouu
  0.5% qp_Puu
  0.5% qp_Quu
  0.5% qp_Guu
  0.5% qp_Kuu
  0.2% qp_E
  0.2% shape_interp
  0.2% deriv_interp
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% T dense
  0.1% T values

KOKKOS HIP SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS SYCL SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS OpenMPTarget SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

Host process high water mark memory consumption: 757924 kB

END KOKKOS PROFILING REPORT.
ddement commented 3 weeks ago

The AssembleStiffnessMatrix and AssembleInertiaMatrix Kernels have had a round of optimization. Total runtime has been reduced by about 10%

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 3.82279 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 3.56e+00 sec 93.2% 8.7% 0.0% 0.0% 7.93e+02 11 Step [region]
|   |-> 3.26e+00 sec 85.1% 1.5% 0.0% 1.6% 3.18e+02 22 Solve System [region]
|   |   |-> 3.16e+00 sec 82.5% 0.0% 0.0% 0.6% 2.09e+01 22 Sparse Solver [region]
|   |   |   |-> 2.43e+00 sec 63.6% 0.0% 0.0% 100.0% 0.00e+00 22 Numeric Factorization [region]
|   |   |   |-> 6.58e-01 sec 17.2% 0.2% 0.0% 99.8% 1.00e+02 22 Symbolic Factorization [region]
|   |   |   |-> 4.72e-02 sec 1.2% 0.0% 0.0% 100.0% 0.00e+00 22 Solve [region]
|   |   |-> 3.00e-02 sec 0.8% 100.0% 0.0% ------ 44 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |-> 2.54e-01 sec 6.6% 90.3% 0.0% 9.6% 2.43e+03 22 Assemble System [region]
|   |   |-> 9.35e-02 sec 2.4% 99.9% 0.0% 0.1% 2.35e+02 22 Assemble Stiffness Matrix [region]
|   |   |   |-> 9.34e-02 sec 2.4% 100.0% 0.0% ------ 22 IntegrateStiffnessMatrix [for]
|   |   |-> 4.54e-02 sec 1.2% 99.8% 0.0% 0.2% 4.85e+02 22 Assemble Inertia Matrix [region]
|   |   |   |-> 4.53e-02 sec 1.2% 100.0% 0.0% ------ 22 IntegrateInertiaMatrix [for]
|   |   |-> 2.46e-02 sec 0.6% 100.0% 0.0% ------ 22 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |   |-> 1.61e-02 sec 0.4% 100.0% 0.0% ------ 44 ContributeElementsToSparseMatrix [for]
|   |   |-> 1.16e-02 sec 0.3% 100.0% 0.0% ------ 22 sort_crs_matrix [for]
|   |   |-> 1.11e-02 sec 0.3% 100.0% 0.0% ------ 22 KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY_SPREADTEAM [for]
|   |   |-> 5.87e-03 sec 0.2% 100.0% 0.0% ------ 66 Kokkos::ViewFill-1D [for]
|   |   |-> 4.15e-03 sec 0.1% 100.0% 0.0% ------ 44 KokkosKernels::Common::PrefixSum [scan]
|   |-> 3.09e-02 sec 0.8% 42.4% 0.0% 57.6% 1.64e+04 22 Assemble Constraints [region]
|   |-> 1.99e-02 sec 0.5% 91.3% 0.0% 8.7% 2.76e+04 22 Update State [region]
|   |   |-> 5.20e-03 sec 0.1% 100.0% 0.0% ------ 44 RotateSectionMatrix [for]
|-> 3.65e-02 sec 1.0% 100.0% 0.0% ------ 1 PopulateSparseRowPtrs_Constraints_Transpose [for]
|-> 1.35e-02 sec 0.4% 100.0% 0.0% ------ 1 PopulateSparseIndices [for]
|-> 5.05e-03 sec 0.1% 100.0% 0.0% ------ 3300 Kokkos::View::initialization [u_mirror] via memset [for]

BOTTOM-UP TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 2.43e+00 sec 63.6% 0.0% 0.0% 0.0% 0.00e+00 22 Numeric Factorization [region]
|   |-> 2.43e+00 sec 63.6% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|       |-> 2.43e+00 sec 63.6% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|           |-> 2.43e+00 sec 63.6% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 6.57e-01 sec 17.2% 0.0% 0.0% 0.0% 0.00e+00 22 Symbolic Factorization [region]
|   |-> 6.57e-01 sec 17.2% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|       |-> 6.57e-01 sec 17.2% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|           |-> 6.57e-01 sec 17.2% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 9.34e-02 sec 2.4% 100.0% 0.0% ------ 22 IntegrateStiffnessMatrix [for]
|   |-> 9.34e-02 sec 2.4% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Stiffness Matrix [region]
|       |-> 9.34e-02 sec 2.4% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|           |-> 9.34e-02 sec 2.4% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 5.46e-02 sec 1.4% 100.0% 0.0% ------ 66 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |-> 3.00e-02 sec 0.8% 100.0% 0.0% 0.0% 0.00e+00 44 Solve System [region]
|   |   |-> 3.00e-02 sec 0.8% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|   |-> 2.46e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|       |-> 2.46e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 5.24e-02 sec 1.4% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|   |-> 5.24e-02 sec 1.4% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 4.72e-02 sec 1.2% 0.0% 0.0% 0.0% 0.00e+00 22 Solve [region]
|   |-> 4.72e-02 sec 1.2% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|       |-> 4.72e-02 sec 1.2% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|           |-> 4.72e-02 sec 1.2% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 4.53e-02 sec 1.2% 100.0% 0.0% ------ 22 IntegrateInertiaMatrix [for]
|   |-> 4.53e-02 sec 1.2% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Inertia Matrix [region]
|       |-> 4.53e-02 sec 1.2% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|           |-> 4.53e-02 sec 1.2% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 3.65e-02 sec 1.0% 100.0% 0.0% ------ 1 PopulateSparseRowPtrs_Constraints_Transpose [for]
|-> 2.44e-02 sec 0.6% 0.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |-> 2.44e-02 sec 0.6% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.96e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|   |-> 1.96e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|       |-> 1.96e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.78e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Assemble Constraints [region]
|   |-> 1.78e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.61e-02 sec 0.4% 100.0% 0.0% ------ 44 ContributeElementsToSparseMatrix [for]
|   |-> 1.61e-02 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 44 Assemble System [region]
|       |-> 1.61e-02 sec 0.4% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 1.35e-02 sec 0.4% 100.0% 0.0% ------ 1 PopulateSparseIndices [for]
|-> 1.20e-02 sec 0.3% 100.0% 0.0% ------ 44 sort_crs_matrix [for]
|   |-> 1.16e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |   |-> 1.16e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.11e-02 sec 0.3% 100.0% 0.0% ------ 22 KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY_SPREADTEAM [for]
|   |-> 1.11e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|       |-> 1.11e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 7.90e-03 sec 0.2% 100.0% 0.0% ------ 135 Kokkos::ViewFill-1D [for]
|   |-> 5.87e-03 sec 0.2% 100.0% 0.0% 0.0% 0.00e+00 66 Assemble System [region]
|   |   |-> 5.87e-03 sec 0.2% 100.0% 0.0% 0.0% 0.00e+00 66 Step [region]
|-> 6.41e-03 sec 0.2% 100.0% 0.0% ------ 110 KokkosKernels::Common::PrefixSum [scan]
|   |-> 4.15e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Assemble System [region]
|   |   |-> 4.15e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 5.53e-03 sec 0.1% 100.0% 0.0% ------ 66 KokkosSparse::SpAdd::Symbolic::InputSorted::CountEntries [for]
|-> 5.44e-03 sec 0.1% 100.0% 0.0% ------ 46 RotateSectionMatrix [for]
|   |-> 5.44e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 46 Update State [region]
|       |-> 5.20e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 5.05e-03 sec 0.1% 100.0% 0.0% ------ 3300 Kokkos::View::initialization [u_mirror] via memset [for]
|-> 4.30e-03 sec 0.1% 100.0% 0.0% ------ 44 KokkosSparse::StructureC::GPU_EXEC [for]
|-> 3.83e-03 sec 0.1% 100.0% 0.0% ------ 88 KokkosSparse::PredicMaxRowNNZ::STATIC [reduce]

KOKKOS HOST SPACE:
===================
MAX MEMORY ALLOCATED: 84290.8 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/mat_nzvals
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/mat_colind
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/colind
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/nzval_tmp
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/indices_tmp
  12.3% Step/Solve System/Sparse Solver/Symbolic Factorization/host_nzvals_view_
  12.3% Step/Solve System/values_mirror
  6.2% Step/Solve System/Sparse Solver/Symbolic Factorization/host_rows_view_
  6.2% Step/Solve System/columnIndices_mirror
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/pointers_tmp
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/rowptr
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/rowPtrsUnpacked_host_
  0.2% Step/Solve System/x_mirror
  0.2% Step/Solve System/lgMap_mirror
  0.2% Step/Solve System/b_mirror
  0.2% Step/Solve System/Sparse Solver/Symbolic Factorization/lgMap_mirror
  0.1% Step/Solve System/Sparse Solver/Symbolic Factorization/host_col_ptr_view_

KOKKOS CUDA SPACE:
===================
MAX MEMORY ALLOCATED: 795689.6 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  37.3% Step/Assemble System/CopyTangentIntoSparseMatrix/Kokkos::CudaSpace::TeamScratchMemory
  35.0% gradient_matrix
  7.9% Step/Assemble System/pool data
  1.3% Step/Solve System/values
  1.3% Step/Solve System/values
  1.3% matrix_terms
  1.3% K values
  1.3% Step/Assemble System/valuesC
  1.3% Step/Assemble System/values
  0.7% Step/Solve System/entries
  0.6% Step/Solve System/entries
  0.6% indices
  0.6% Step/Assemble System/entriesC
  0.6% Step/Assemble System/entries
  0.5% qp_Mstar
  0.5% qp_Cstar
  0.5% qp_RR0
  0.5% qp_Muu
  0.5% qp_Cuu
  0.5% qp_Ouu
  0.5% qp_Puu
  0.5% qp_Quu
  0.5% qp_Guu
  0.5% qp_Kuu
  0.2% qp_E
  0.2% shape_interp
  0.2% deriv_interp
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% R1_3x3
  0.1% T dense
  0.1% T values

KOKKOS HIP SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS SYCL SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS OpenMPTarget SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

Host process high water mark memory consumption: 760188 kB