entity-toolkit / entity

New generation astrophysical plasma simulation code with CPU/GPU portability
https://entity-toolkit.github.io/wiki/
Other
30 stars 3 forks source link

Potential issue in MPI comm for curvilinear SRPIC #58

Closed haykh closed 4 months ago

LudwigBoess commented 4 months ago

Maybe related, I also encounter an MPI error in cartesian/minkowski SRPIC with the wip/shock setup:

PMPI_Allgather(1000): MPI_Allgather(sbuf=0x490bc70, scount=1, MPI_FLOAT, rbuf=0x490bc70, rcount=1, MPI_FLOAT, MPI_COMM_WORLD) failed
PMPI_Allgather(945).: Buffers must not be aliased
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=201926145
:
system msg for write_line failure : Bad file descriptor
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
srun: error: midway3-0278: task 0: Exited with exit code 1
haykh commented 4 months ago

Maybe related, I also encounter an MPI error in cartesian/minkowski SRPIC with the wip/shock setup:

PMPI_Allgather(1000): MPI_Allgather(sbuf=0x490bc70, scount=1, MPI_FLOAT, rbuf=0x490bc70, rcount=1, MPI_FLOAT, MPI_COMM_WORLD) failed
PMPI_Allgather(945).: Buffers must not be aliased
[unset]: write_line error; fd=-1 buf=:cmd=abort exitcode=201926145
:
system msg for write_line failure : Bad file descriptor
Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
srun: error: midway3-0278: task 0: Exited with exit code 1

likely unrelated, this problem is simply wrong comm without errors.

@LudwigBoess is this a runtime or compile-time error? if runtime -- could you post the command you use to run? (or submit script) if compile-time, what MPI are you using?

CUDA with MPI is a bit of a headache to configure at first on a new machine. especially given the fact that different clusters have different env variables defined.

haykh commented 4 months ago

Culprit identified as potential race condition in src/kernels/injectors.hpp -- kernels::NonUniformInjector_kernel::operator(). Switching from Kokkos::atomic_fetch_add(&idx(), ppc) to Kokkos::atomic_fetch_add(&idx(), 1) solved the issue.