We have observed some issues with mfem::Memory aliasing when running on GPUs, which look like error messages of the form:
CUDA error: (cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice)) failed with error:
--> misaligned address
... in function: void* mfem::CuMemcpyHtoD(void*, const void*, size_t)
... in file: /data/home/hughcars/palace/build_g5_48xlarge/extern/mfem/general/cuda.cpp:116
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
One place this aliasing is used is mfem::ComplexGridFunction and mfem::ParComplexGridFunction. This PR replaces the aliases with two separate ParGridFunction objects for the real and imaginary parts, and resolves the errors in our testing. I'm not certain this is an error with MFEM's MemoryManager or just our use of it without adequate synchronization. But for now, keeping this objects stored separately seems to resolve the problems.
We have observed some issues with
mfem::Memory
aliasing when running on GPUs, which look like error messages of the form:One place this aliasing is used is
mfem::ComplexGridFunction
andmfem::ParComplexGridFunction
. This PR replaces the aliases with two separateParGridFunction
objects for the real and imaginary parts, and resolves the errors in our testing. I'm not certain this is an error with MFEM'sMemoryManager
or just our use of it without adequate synchronization. But for now, keeping this objects stored separately seems to resolve the problems.NOTE: This is on top of https://github.com/awslabs/palace/pull/193 and resolves the observed instances in testing for GPU support. https://github.com/awslabs/palace/pull/194 to follow this PR.