Closed kilicomu closed 1 year ago
Hi @kilicomu, I have not seen that error and you are the first to report it. Do you see it in 14.0.0 as well? That version is when the update from MAPL 2.6 to 2.18.3 happened.
@tclune, @bena-nasa, have you seen this error before?
Yes - this has been seen before, but it was a long while back. If memory serves it was due to not deleting some MPI objects like user defined MPI types and such. (Was very dependent on the flavor of MPI)
@weiyuan-jiang did the work. @mathomp4 can probably figure out if your MAPL is old enough to experience this issue. Friday is a holiday, so response might be slow.
@lizziel Same seg fault with 14.0.0, see archived run here:
GCHP_14.0.0_MAPL_CAP_SEG_FAULT.tar.gz
No seg fault with 13.4.1, archived run here:
GCHP_13.4.1_MAPL_CAP_SEG_FAULT.tar.gz
So might be something to do with the MAPL version change.
@tclune Thanks - I'll see about different MPI implementations. I think I was the person who reported the issue you are referring to! I've got MAPL @ 77fb1d43, the version linked to GCHP 14.0.1.
As far as I know, there have been no big MPI or profiler fixes in MAPL since 2.18.3. I suppose the usual two things to try are:
The former should be doable since MAPL 2.30 should be close enough to 2.18 to work. The latter is the hard one since you seem to be on a Cray and, well, Cray MPI is often the one MPI stack on a Cray. Moreover, 8.1.4 is pretty new (8.1.7 is the latest I know of) so not like the stack is old.
I can say I've never had much luck getting GEOS to work with MPICH or MVAPICH2...but we run all the time with Intel MPI which is based on MPICH! I might try a GNU+MPICH build of GEOS next week to see.
Yes, sadly only the Cray libraries on this system. I'm just double-checking with my home system that I don't see the same error with Intel MPI / OpenMPI, after which I'll try bumping the MAPL version.
I've done some more testing...
If I swap out OpenFabrics for for UCX the seg fault goes away. I don't think I have access to a debug build of the MPI stack on this system, but I'll see if I can get some more info.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.
This isn't bothering me anymore as I've not seent the issue since changing to UCX. If it comes up again I'll reopen and dig deeper.
What institution are you from?
Wolfson Atmospheric Chemistry Laboratories
Description of the problem
Since switching to GCHP 14.0.1, I'm seeing what look like MAPL CAP seg faults at the end of my runs:
It's coming from here (at least I hope so, and not from the MPI library...) in MAPL's
pfio/DirectoryService.F90
:specifically the call to
MPI_Barrier
. I'm wondering if the communicator has already been cleaned up at the point this subroutine is run?I'll carry on trying to debug the cause, but if you've seen it before or have any ideas, let me know. Although almost everything the model needs to do gets done, it annoyingly seg faults before the final model throughput comes through in the log, and I'm using the model throughput figure for some scaling tests I'm running on a new system. I can user information from the timers that do get printed for the time being, but ideally I'd like to use the final throughput figure.
GEOS-Chem version
14.0.1 (a1be697c)
Description of code modifications
None.
Log files
GCHP_14.0.1_MAPL_CAP_SEG_FAULT.tar.gz
Logs/GCHP_14.0.1_MAPL_CAP_SEG_FAULT_2591050.log
Software versions