awslabs / palace

3D finite element solver for computational electromagnetics
https://awslabs.github.io/palace/dev
Apache License 2.0
224 stars 50 forks source link

Add `palace::GridFunction` to unify `mfem::ParGridFunction` and `mfem::ParComplexGridFunction` #204

Closed sebastiangrimberg closed 4 months ago

sebastiangrimberg commented 4 months ago

We have observed some issues with mfem::Memory aliasing when running on GPUs, which look like error messages of the form:

CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice)) failed with error:
 --> misaligned address
 ... in function: void* mfem::CuMemcpyHtoD(void*, const void*, size_t)
 ... in file: /data/home/hughcars/palace/build_g5_48xlarge/extern/mfem/general/cuda.cpp:116
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

One place this aliasing is used is mfem::ComplexGridFunction and mfem::ParComplexGridFunction. This PR replaces the aliases with two separate ParGridFunction objects for the real and imaginary parts, and resolves the errors in our testing. I'm not certain this is an error with MFEM's MemoryManager or just our use of it without adequate synchronization. But for now, keeping this objects stored separately seems to resolve the problems.

NOTE: This is on top of https://github.com/awslabs/palace/pull/193 and resolves the observed instances in testing for GPU support. https://github.com/awslabs/palace/pull/194 to follow this PR.