Closed LiamBindle closed 4 years ago
Thanks @LiamBindle. Do you know if using mpiuni set mpi to automatically use one core per node? Does this implementation exist for some specific purpose?
@lizziel According to this mpiuni is "a single-processor MPI-bypass library". I don't really understand what that means, but it seems to fit inline with what I was seeing (each process thinking they were root).
Actually, last night I just noticed ESMF_COMM wasn't "intelmpi", and I thought it should be, so I rebuilt ESMF and it fixed my problem. I tried to generalize this issue for the purpose of the issue archive. This issue is really "ESMF was built with the wrong ESMF_COMM" and isn't because of a ESMF + Intel MPI compatibility problem.
Here is the part of spack that sets ESMF_COMM. It looks like ESMF_COMM=intelmpi
iff +mpi
and ^intel-parallel-studio+mpi
is in the spack install spec. Our compute1 sysadmin was having trouble getting spack to concretize this spec though, and ultimately opted to build ESMF manually rather than with spack. I suspect someone familiar with spack could advise on how to do this properly, but rebuilding ESMF manually was easy enough.
Hi everyone,
I'm just submitting this for the archive of issues on GitHub.
Relevent Information
ESMF_COMM=mpiuni
What happened
Yesterday I tried running the default 6-core 1-node 1-hour GCHP simulation and it crashed almost immediately. This happned with GHCP_CTM 13.0.0-alpha.1, but this could happen with any version that uses MAPL 2.0+. Below is the full output. The important parts to pick out are:
Failed run output:
The Problem
The issue was ESMF was built with
ESMF_COMM=mpiuni
. This appears to have happended because the spack install spec wasn't quite right, but I didn't build ESMF myself so I can't be sure.How do I check which ESMF_COMM my ESMF was built with?
The build-time value of ESMF_COMM is written to
esmf.mk
beside your ESMF libraries. You can see it with the following commandor
Solution
Rebuild ESMF and make sure
ESMF_COMM
is set to the appropriate MPI flavor.