Closed rrsettgast closed 2 years ago
I'm struggling to find good open-source examples of on-GPU assembly with Tpetra. The closest I've seen is https://trilinos.org/docs/dev/packages/tpetra/doc/html/Tpetra_Lesson07.html where they essentially build out CSR structures by hand (rowptr, colind, vals, stored as Kokkos views), and use them to initialize a Tpetra::CrsMatrix
. The SumIntoGlobalValues
interface we are currently using does not seem device-compatible (even with a pre-defined static graph). It'd be nice to see what other LA packages' GPU interfaces look like - in case we might have to change our approach to assembly and our LAI classes as a result.
@hmmorgan @rtmills
Do you have any examples of on-device construction of a matrix for use in petsc?
I did a quick review of on-device assembly in Trilinos, hypre and PETSc. The conclusion is that none of the three packages actually expose __device__
functions for adding/setting individual values/blocks of values. Instead, the common approach seems to be to let the user build arrays of matrix contributions in the target memory space and just consume the device pointers (in a host-callable function or constructor).
For example:
Kokkos::View
s and passes to Tpetra::CrsMatrix
constructor. I have not been able to find examples of assembling into a pre-allocated sparsity pattern on device. Also Slide 13 here suggests that this is indeed their intended approach.MatSetValues
in a host loop. Their existing GPU-based formats (MATAIJCUSPARSE
and MATAIJVIENNACL
) seem to be assembled on host and transferred to the GPU on call to MatAssemblyEnd
. I couldn't find out how to feed it device pointers, but I'm hoping there's a way.Based on this, I think our approach should be assembling into our own data structure (which may be LvArray::CSRMatrix
or something similar) and then pass it into a LAI package. This approach has some nice consequences:
Thoughts?
That sounds pretty reasonable to me, though honestly I'd just be psyched if the LvArray::CRSMatrix
is of use.
So could we do that with our current host approach? Assemble into a CRSMatrix and then pass the pointers off to the LAI? If so that seems a good place to start.
@klevzoff Thanks! Perhaps we are just going to be expected to generate our own CRS structure. So can we develop a task list for each package? Here is a first cut.
Trilinos:
We will have to see if we are able to use our array classes, or if we have to use Kokkos::View
. I imagine we can just make a shallow Kokkos::View
to our data.
Components of Rhs construction:
rhs
locally using number of local rows
. We would just allocate an array1d
and capture it in the kernel launch to fill.
Kokkos::View<double*> forcingTerm ("forcing term", numLclNodes);
Tpetra::Vector
...but this is after creating the Tpetra::CrsMatrix
// Hack to deal with the fact that Tpetra::Vector needs a
// DualView<double**> for now, rather than a View<double*>.
Kokkos::DualView<double**, Kokkos::LayoutLeft> b_lcl ("b", numLclRows, 1);
b_lcl.modify<Kokkos::DualView<double**, Kokkos::LayoutLeft>::t_dev::execution_space> ();
Kokkos::deep_copy (Kokkos::subview (b_lcl.d_view, Kokkos::ALL (), 0), forcingTerm);
Tpetra::Vector<> b (A.getRangeMap (), b_lcl);
I am a little concerned by the deep copy, but I can't tell if this would be required.
Components of Jacobian construction:
numOfLocalRows
rowCounts
, rowOffsets
, numLocalEntries
, colIndices
. I think we can do this at the same time by simply using our native LvArray::CRSMatrix
with our existing approach and just extracting the pointers from LvArray::CRSMatrix
. The example is kind of all over the place and I don't know if it needs to be.numGlobalEntries
.matrixValue
? Again...hope we don't have to use a Kokkos::View
for this and we can just pass the data pointer from LvArray::CRSMatrix
rowMap
and colMap
RCP<const Tpetra::Map<> > rowMap =
rcp (new Tpetra::Map<> (numGblElements, numLclElements, indexBase, comm));
// Flag to tell Tpetra::Map to compute the global number of
// indices in the Map.
const Tpetra::global_size_t INV =
Teuchos::OrdinalTraits
7. Create the `Tpetra::CrsMatrix`
```c++
Tpetra::CrsMatrix<double> A (rowMap, colMap, rowOffsets,
colIndices, matrixValues);
Does this sound reasonable?
Is your feature request related to a problem? Please describe. We currently use the old cpu only interface to trilinos. Instead of
Epetra
we should be usingTpetra
?? And I assume that there is a new set of linear solvers and preconditioners as well.Describe the solution you'd like A perfect transition between the legacy capabilities and the new platform-portable capabilities.
Describe alternatives you've considered Making a time machine and going back to 2005 so we don't have to worry about GPU's.
Additional context I don't want to do this. Many beers will be owed to the brave soul who takes this on.