Exawind / openturbine

A flexible multibody structural dynamics code for wind turbines
https://exawind.github.io/openturbine/
MIT License
16 stars 6 forks source link

Sparse Solver Implementation (Direct and/or Indirect) #185

Closed ddement closed 1 month ago

ddement commented 2 months ago

Task: Create a proof of concept using Trilinos' Amesos2 solver as our direct solver. Demonstrate performance and build complexity.

Why: While the prospect of using direct solvers through Kokkos-Kernels is appealing, significant development work would be needed to make that a reality. Amesos2 provides us with an interface to sparse direct solvers that run on GPU already, but the added complexity of using Trilinos is not appealing. This work will allow us to make an informed decision regarding the direction we want to take with our linear solvers

Done means: OpenTurbine uses the Amesos2 solvers. Performance and memory consumption relative to the dense solvers is documented for the 300 element turbine problem. We have determined if Amesos2 is our solver of choice, or if we need to investigate other options.

michaelasprague commented 2 months ago

First trying Amesos2 because it exists. Trilinos is anything but a lightweight TPL -- aiming for prototype.

ddement commented 2 months ago

Amesos2 implementation prototype is complete. Here is the profiling report for the 300 blade problem. The old runtime in the step region (i.e. excluding test setup/teardown) was about 28.1s using the dense solver. The new runtime is 3.91s. Memory usage has also been reduced considerably by removing the dense matrix and its working variables.

BEGIN KOKKOS PROFILING REPORT:
TOTAL TIME: 4.1774 seconds
TOP-DOWN TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <remainder> <kernels per second> <number of calls> <name> [type]
===================
|-> 3.91e+00 sec 93.5% 17.9% 0.0% 0.0% 7.35e+02 11 Step [region]
|   |-> 3.02e+00 sec 72.4% 1.8% 0.0% 1.5% 3.42e+02 22 Solve System [region]
|   |   |-> 2.93e+00 sec 70.1% 0.0% 0.0% 100.0% 2.26e+01 22 Sparse Solver [region]
|   |   |-> 3.70e-02 sec 0.9% 100.0% 0.0% ------ 44 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |-> 8.24e-01 sec 19.7% 74.5% 0.0% 25.4% 8.01e+02 22 Assemble System [region]
|   |   |-> 1.25e-01 sec 3.0% 99.9% 0.0% 0.1% 1.76e+02 22 Assemble Mass Matrix [region]
|   |   |   |-> 1.25e-01 sec 3.0% 100.0% 0.0% ------ 22 IntegrateMatrix [for]
|   |   |-> 1.25e-01 sec 3.0% 99.9% 0.0% 0.1% 1.76e+02 22 Assemble Inertial Stiffness Matrix [region]
|   |   |   |-> 1.25e-01 sec 3.0% 100.0% 0.0% ------ 22 IntegrateMatrix [for]
|   |   |-> 1.25e-01 sec 3.0% 99.9% 0.0% 0.1% 1.76e+02 22 Assemble Gyroscopic Inertia Matrix [region]
|   |   |   |-> 1.25e-01 sec 3.0% 100.0% 0.0% ------ 22 IntegrateMatrix [for]
|   |   |-> 1.21e-01 sec 2.9% 99.9% 0.0% 0.1% 1.82e+02 22 Assemble Elastic Stiffness Matrix [region]
|   |   |   |-> 1.21e-01 sec 2.9% 100.0% 0.0% ------ 22 IntegrateElasticStiffnessMatrix [for]
|   |   |-> 2.82e-02 sec 0.7% 100.0% 0.0% ------ 22 KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY_SPREADTEAM [for]
|   |   |-> 2.50e-02 sec 0.6% 100.0% 0.0% ------ 66 CopyIntoSparseMatrix [for]
|   |   |-> 2.45e-02 sec 0.6% 100.0% 0.0% ------ 22 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |   |-> 1.15e-02 sec 0.3% 100.0% 0.0% ------ 22 sort_crs_matrix [for]
|   |   |-> 6.87e-03 sec 0.2% 100.0% 0.0% ------ 22 KokkosSparse::StructureC::GPU_EXEC [for]
|   |   |-> 6.53e-03 sec 0.2% 100.0% 0.0% ------ 66 Kokkos::ViewFill-1D [for]
|   |-> 3.57e-02 sec 0.9% 42.3% 0.0% 57.7% 1.42e+04 22 Assemble Constraints [region]
|   |-> 1.87e-02 sec 0.4% 91.6% 0.0% 8.4% 2.95e+04 22 Update State [region]
|   |   |-> 5.15e-03 sec 0.1% 100.0% 0.0% ------ 44 RotateSectionMatrix [for]
|-> 3.65e-02 sec 0.9% 100.0% 0.0% ------ 1 PopulateSparseRowPtrs_Constraints_Transpose [for]
|-> 1.35e-02 sec 0.3% 100.0% 0.0% ------ 1 PopulateSparseIndices [for]
|-> 4.59e-03 sec 0.1% 100.0% 0.0% ------ 3300 Kokkos::View::initialization [u_mirror] via memset [for]

BOTTOM-UP TIME TREE:
<average time> <percent of total time> <percent time in Kokkos> <percent MPI imbalance> <number of calls> <name> [type]
===================
|-> 2.93e+00 sec 70.0% 0.0% 0.0% 0.0% 0.00e+00 22 Sparse Solver [region]
|   |-> 2.93e+00 sec 70.0% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|       |-> 2.93e+00 sec 70.0% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 3.75e-01 sec 9.0% 100.0% 0.0% ------ 66 IntegrateMatrix [for]
|   |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Mass Matrix [region]
|   |   |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |       |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|   |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Inertial Stiffness Matrix [region]
|   |   |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |       |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|   |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Gyroscopic Inertia Matrix [region]
|       |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|           |-> 1.25e-01 sec 3.0% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 2.09e-01 sec 5.0% 0.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |-> 2.09e-01 sec 5.0% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.21e-01 sec 2.9% 100.0% 0.0% ------ 22 IntegrateElasticStiffnessMatrix [for]
|   |-> 1.21e-01 sec 2.9% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble Elastic Stiffness Matrix [region]
|       |-> 1.21e-01 sec 2.9% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|           |-> 1.21e-01 sec 2.9% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 6.15e-02 sec 1.5% 100.0% 0.0% ------ 66 KokkosSparse::SpAdd:Numeric::InputSorted [for]
|   |-> 3.70e-02 sec 0.9% 100.0% 0.0% 0.0% 0.00e+00 44 Solve System [region]
|   |   |-> 3.70e-02 sec 0.9% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|   |-> 2.45e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|       |-> 2.45e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 4.51e-02 sec 1.1% 0.0% 0.0% 0.0% 0.00e+00 22 Solve System [region]
|   |-> 4.51e-02 sec 1.1% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 3.65e-02 sec 0.9% 100.0% 0.0% ------ 1 PopulateSparseRowPtrs_Constraints_Transpose [for]
|-> 2.92e-02 sec 0.7% 100.0% 0.0% ------ 44 KOKKOSPARSE::SPGEMM::SPGEMM_KK_MEMORY_SPREADTEAM [for]
|   |-> 2.82e-02 sec 0.7% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |   |-> 2.82e-02 sec 0.7% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 2.55e-02 sec 0.6% 100.0% 0.0% ------ 88 CopyIntoSparseMatrix [for]
|   |-> 2.50e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 66 Assemble System [region]
|   |   |-> 2.50e-02 sec 0.6% 100.0% 0.0% 0.0% 0.00e+00 66 Step [region]
|-> 2.06e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Assemble Constraints [region]
|   |-> 2.06e-02 sec 0.5% 0.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 1.35e-02 sec 0.3% 100.0% 0.0% ------ 1 PopulateSparseIndices [for]
|-> 1.30e-02 sec 0.3% 100.0% 0.0% ------ 44 sort_crs_matrix [for]
|   |-> 1.15e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |   |-> 1.15e-02 sec 0.3% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 8.82e-03 sec 0.2% 100.0% 0.0% ------ 135 Kokkos::ViewFill-1D [for]
|   |-> 6.53e-03 sec 0.2% 100.0% 0.0% 0.0% 0.00e+00 66 Assemble System [region]
|   |   |-> 6.53e-03 sec 0.2% 100.0% 0.0% 0.0% 0.00e+00 66 Step [region]
|-> 7.42e-03 sec 0.2% 100.0% 0.0% ------ 44 KokkosSparse::StructureC::GPU_EXEC [for]
|   |-> 6.87e-03 sec 0.2% 100.0% 0.0% 0.0% 0.00e+00 22 Assemble System [region]
|   |   |-> 6.87e-03 sec 0.2% 100.0% 0.0% 0.0% 0.00e+00 22 Step [region]
|-> 5.56e-03 sec 0.1% 100.0% 0.0% ------ 66 KokkosSparse::SpAdd::Symbolic::InputSorted::CountEntries [for]
|-> 5.38e-03 sec 0.1% 100.0% 0.0% ------ 46 RotateSectionMatrix [for]
|   |-> 5.38e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 46 Update State [region]
|       |-> 5.15e-03 sec 0.1% 100.0% 0.0% 0.0% 0.00e+00 44 Step [region]
|-> 5.14e-03 sec 0.1% 100.0% 0.0% ------ 110 KokkosKernels::Common::PrefixSum [scan]
|-> 4.96e-03 sec 0.1% 100.0% 0.0% ------ 44 KokkosSparse::SingleStepZipMatrix::GPUEXEC [for]
|-> 4.59e-03 sec 0.1% 100.0% 0.0% ------ 3300 Kokkos::View::initialization [u_mirror] via memset [for]

KOKKOS HOST SPACE:
===================
MAX MEMORY ALLOCATED: 91040.8 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  12.3% Step/Solve System/Sparse Solver/mat_nzvals
  12.3% Step/Solve System/Sparse Solver/mat_colind
  12.3% Step/Solve System/Sparse Solver/colind
  12.3% Step/Solve System/Sparse Solver/nzval_tmp
  12.3% Step/Solve System/Sparse Solver/indices_tmp
  12.3% Step/Solve System/Sparse Solver/host_nzvals_view_
  12.3% Step/Solve System/values_mirror
  6.2% Step/Solve System/Sparse Solver/host_rows_view_
  6.2% Step/Solve System/columnIndices_mirror
  0.2% Step/Solve System/Sparse Solver/rowptr
  0.2% Step/Solve System/Sparse Solver/rowPtrsUnpacked_host_
  0.2% Step/Solve System/Sparse Solver/pointers_tmp
  0.2% Step/Solve System/Sparse Solver/lgMap_mirror
  0.2% Step/Solve System/x_mirror
  0.2% Step/Solve System/lgMap_mirror
  0.2% Step/Solve System/b_mirror

KOKKOS CUDA SPACE:
===================
MAX MEMORY ALLOCATED: 3877266.9 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:
  79.0% K dense
  7.7% Step/Assemble System/CopyIntoSparseMatrix/Kokkos::CudaSpace::TeamScratchMemory
  7.2% gradient_matrix
  1.8% Step/Assemble System/pool data
  0.3% Step/Solve System/values
  0.3% Step/Solve System/values
  0.3% K values
  0.3% T values
  0.3% Step/Assemble System/valuesC
  0.3% Step/Assemble System/values
  0.1% Step/Solve System/entries
  0.1% Step/Solve System/entries
  0.1% indices
  0.1% Step/Assemble System/entriesC
  0.1% Step/Assemble System/entries
  0.1% Step/Assemble System/set_entries_
  0.1% Step/Assemble System/set_indices_
  0.1% qp_Mstar
  0.1% qp_Cstar
  0.1% qp_RR0
  0.1% qp_Muu
  0.1% qp_Cuu
  0.1% qp_Ouu
  0.1% qp_Puu
  0.1% qp_Quu
  0.1% qp_Guu
  0.1% qp_Kuu

KOKKOS HIP SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS SYCL SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

KOKKOS OpenMPTarget SPACE:
===================
MAX MEMORY ALLOCATED: 0.0 kB
ALLOCATIONS AT TIME OF HIGH WATER MARK:

Host process high water mark memory consumption: 764280 kB

END KOKKOS PROFILING REPORT.
pscrozi commented 2 months ago

David has the prototype done, and profile seen above. 28 s before, and now 3.8 s after! Amazing speedup! Reduced memory usage by 66%.

Now up as a draft PR for anyone else to review. Already in a pretty good place. Working on getting CI to work with it. Easy to install using spack with Trilinos. A bit more work to get it working with our CI environment. Ongoing and in progress, but should be done soon.

Reached out to Kokkos Kernels team, and they seem receptive to work with us to implement a solver through KK, but started with recommending Amesos2. Haven't heard back from them since. Might be easiest to use what we've already got since it seems to work fine.

Setup for Trilinos is one line of spack install. David can supply others with a simple build script so that they can play with it.