geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
23 stars 25 forks source link

[BUG/ISSUE] MAPL CAP seg fault at the end of 14.0.1 runs #266

Closed kilicomu closed 1 year ago

kilicomu commented 1 year ago

What institution are you from?

Wolfson Atmospheric Chemistry Laboratories

Description of the problem

Since switching to GCHP 14.0.1, I'm seeing what look like MAPL CAP seg faults at the end of my runs:

#0  0x2b5da01ce49f in ???
#1  0x2b5d9e9d7f57 in ???
#2  0x2b5d9e5572cf in ???
#3  0x2b5d9e21619e in ???
#4  0x2b5d9e2168fd in ???
#5  0x2b5d9e217ce7 in ???
#6  0x2b5d9c833480 in ???
#7  0x2b5d9e21c14f in ???
#8  0x2b5d9e3b32e2 in ???
#9  0x2b5d9c65c90f in ???
#10  0x2b5d9c65ce4e in ???
#11  0x2b5d9c1fca5a in ???
#12  0x3df12ce in __pfio_directoryservicemod_MOD_free_directory_resources
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/MAPL/pfio/DirectoryService.F90:586
#13  0x3b05c8c in __mapl_servermanager_MOD_finalize
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/MAPL/base/ServerManager.F90:296
#14  0x1ca2188 in __mapl_capmod_MOD_finalize_io_clients_servers
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:175
#15  0x1ca2355 in __mapl_capmod_MOD_run_ensemble
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:153
#16  0x1ca2400 in __mapl_capmod_MOD_run
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/MAPL/gridcomps/Cap/MAPL_Cap.F90:134
#17  0x4201b4 in gchpctm_main
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/GCHPctm.F90:31
#18  0x420282 in main
        at /mnt/lustre/a2fs-work2/work/n02/n02/klcm500/GCHP/RUNDIRS/DEBUG/MAPL_CAP_SEG_FAULT/CodeDir/src/GCHPctm.F90:14

It's coming from here (at least I hope so, and not from the MPI library...) in MAPL's pfio/DirectoryService.F90:

   subroutine free_directory_resources(this, rc)
      class (DirectoryService), intent(inout) :: this
      integer, optional, intent(out) :: rc
      type (Directory), pointer :: dir
      integer :: ierror
      ! Release resources

      call MPI_Barrier(this%comm, ierror)

      call this%mutex%free_mpi_resources()

      call MPI_Win_free(this%win_server_directory, ierror)
      call MPI_Win_free(this%win_client_directory, ierror)

      if (this%rank == 0) then
         call c_f_pointer(this%server_dir, dir)
         call MPI_Free_mem(dir, ierror)
         call c_f_pointer(this%client_dir, dir)
         call MPI_Free_mem(dir, ierror)
      end if

      call Mpi_Comm_free(this%comm, ierror)
      _RETURN(_SUCCESS)
   end subroutine free_directory_resources

specifically the call to MPI_Barrier. I'm wondering if the communicator has already been cleaned up at the point this subroutine is run?

I'll carry on trying to debug the cause, but if you've seen it before or have any ideas, let me know. Although almost everything the model needs to do gets done, it annoyingly seg faults before the final model throughput comes through in the log, and I'm using the model throughput figure for some scaling tests I'm running on a new system. I can user information from the timers that do get printed for the time being, but ideally I'd like to use the final throughput figure.

GEOS-Chem version

14.0.1 (a1be697c)

Description of code modifications

None.

Log files

GCHP_14.0.1_MAPL_CAP_SEG_FAULT.tar.gz

Software versions

lizziel commented 1 year ago

Hi @kilicomu, I have not seen that error and you are the first to report it. Do you see it in 14.0.0 as well? That version is when the update from MAPL 2.6 to 2.18.3 happened.

@tclune, @bena-nasa, have you seen this error before?

tclune commented 1 year ago

Yes - this has been seen before, but it was a long while back. If memory serves it was due to not deleting some MPI objects like user defined MPI types and such. (Was very dependent on the flavor of MPI)

@weiyuan-jiang did the work. @mathomp4 can probably figure out if your MAPL is old enough to experience this issue. Friday is a holiday, so response might be slow.

kilicomu commented 1 year ago

@lizziel Same seg fault with 14.0.0, see archived run here:

GCHP_14.0.0_MAPL_CAP_SEG_FAULT.tar.gz

No seg fault with 13.4.1, archived run here:

GCHP_13.4.1_MAPL_CAP_SEG_FAULT.tar.gz

So might be something to do with the MAPL version change.

@tclune Thanks - I'll see about different MPI implementations. I think I was the person who reported the issue you are referring to! I've got MAPL @ 77fb1d43, the version linked to GCHP 14.0.1.

mathomp4 commented 1 year ago

As far as I know, there have been no big MPI or profiler fixes in MAPL since 2.18.3. I suppose the usual two things to try are:

  1. Update to latest MAPL
  2. Change MPI stack

The former should be doable since MAPL 2.30 should be close enough to 2.18 to work. The latter is the hard one since you seem to be on a Cray and, well, Cray MPI is often the one MPI stack on a Cray. Moreover, 8.1.4 is pretty new (8.1.7 is the latest I know of) so not like the stack is old.

I can say I've never had much luck getting GEOS to work with MPICH or MVAPICH2...but we run all the time with Intel MPI which is based on MPICH! I might try a GNU+MPICH build of GEOS next week to see.

kilicomu commented 1 year ago

Yes, sadly only the Cray libraries on this system. I'm just double-checking with my home system that I don't see the same error with Intel MPI / OpenMPI, after which I'll try bumping the MAPL version.

kilicomu commented 1 year ago

I've done some more testing...

If I swap out OpenFabrics for for UCX the seg fault goes away. I don't think I have access to a debug build of the MPI stack on this system, but I'll see if I can get some more info.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

kilicomu commented 1 year ago

This isn't bothering me anymore as I've not seent the issue since changing to UCX. If it comes up again I'll reopen and dig deeper.