esmf-org / esmf

The Earth System Modeling Framework (ESMF) is a suite of software tools for developing high-performance, multi-component Earth science modeling applications.
https://earthsystemmodeling.org/
Other
149 stars 70 forks source link

Possible missing MPI_Type_free in ESMCI_VMKernel? #209

Open mathomp4 opened 6 months ago

mathomp4 commented 6 months ago

This is a big longshot in the dark. @climbfuji and I are trying to get GEOS to work with Spack, namely the JCSDA spack-stack. In the tests by @climbfuji with spack-stack, he kept getting crashes at the end of execution of GEOSgcm (and even smaller more boring programs, but ones that did link to MAPL and thus ESMF).

So I started with mothership spack, and my first test showed all was well. But he reminded me that spack-stack builds ESMF as static-only, no shared. So I build GEOS against a static-only ESMF and, yup, crashes on program exit. Turning on all the debugging flags in GEOS and MAPL didn't help too much but I did get out:

double free or corruption (fasttop)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x14d40515cdbf in ???
#1  0x14d40515cd2b in ???
#2  0x14d40515e3e4 in ???
#3  0x14d4051a2c26 in ???
#4  0x14d4051aacc9 in ???
#5  0x14d4051ac8a3 in ???
#6  0x14d415265e9b in _ZNSt15__new_allocatorIP15ompi_datatype_tE10deallocateEPS1_m
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/new_allocator.h:158
#7  0x14d415263683 in _ZNSt16allocator_traitsISaIP15ompi_datatype_tEE10deallocateERS2_PS1_m
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/alloc_traits.h:496
#8  0x14d415260139 in _ZNSt12_Vector_baseIP15ompi_datatype_tSaIS1_EE13_M_deallocateEPS1_m
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/stl_vector.h:387
#9  0x14d41525cd69 in _ZNSt12_Vector_baseIP15ompi_datatype_tSaIS1_EED2Ev
        at /gpfsm/dulocal15/sles15/other/gcc/12.3.0/include/c++/12.3.0/bits/stl_vector.h:366
#10  0x14d41526bd28 in ???
#11  0x14d4051601bd in ???
#12  0x14d411a28a76 in ???
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node borgm001 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Now, not much traceback, but it does seem to point to MPI type-ish stuff? Maybe? Honestly, I'm reaching here.

So I grepped both ESMF and MAPL and many types around but one thing I saw was in ESMCI_VMKernel.C you have:

https://github.com/esmf-org/esmf/blob/609c81179572747407779492c43776e34495d267/src/Infrastructure/VM/src/ESMCI_VMKernel.C#L730-L731

and I don't see a corresponding MPI_Type_free for customType.

Of course, ESMF is complex and this is also C++ code which I am not very good at. It's possible the frees are done elsewhere? (aka Fun with OO programming!)

It's also possible this has absolutely nothing to do with the crash. I mean, I currently load 51 (!) modules when I run with spack so...that's a lot of things to look at. But the fact that just changing from shared to static ESMF causes a crash does point us toward ESMF...

oehmke commented 6 months ago

Thanks for letting us know. This is deep in Gerhard's (@theurich) territory, so I'm going to assign it to him and hopefully he'll have a chance soon to take a look and make sure things are as they should be. What machine is this? I noticed the 12.3 gcc and wondered if this relates to Tom's issue (#397). Is he using a static ESMF?

climbfuji commented 6 months ago

This is on Discover. We've observed the same with gcc@10.1.0

mathomp4 commented 6 months ago

Thanks for letting us know. This is deep in Gerhard's (@theurich) territory, so I'm going to assign it to him and hopefully he'll have a chance soon to take a look and make sure things are as they should be. What machine is this? I noticed the 12.3 gcc and wondered if this relates to Tom's issue (#397). Is he using a static ESMF?

@oehmke My guess is @tclune is not using static ESMF. ESMA-Baselibs currently builds ESMF as static and shared and from the experiments @climbfuji and myself have done with spack and other observations, it looks like FindESMF.cmake chooses shared by default.

And I now realize instead of rebuilding ESMF as static only, I could have just set -DUSE_ESMF_STATIC_LIBS=YES in my GEOS builds. Son of a ... dangit. 😠

mathomp4 commented 5 months ago

Well, my current tests are not looking good for this being the issue. I mean, it's probably a memory leak (maybe?), but it'd be teeny. I've tried a few different ways of doing the MPI_Type_free (loop in order, in reverse order) and no change. (Well, one attempt I think I did it too late and thing went nutty, but I think that's due to GEOS.)

As @atrayano said when I talked with him, since it's a double free it's more like MAPL or ESMF is freeing something twice. But, all the MPI_Type_free in MAPL explicitly are matched up. And most all in ESMF are as well. Grah.

mathomp4 commented 5 months ago

As a test, per a suggestion by @oehmke, I built ESMF with ESMF_PIO=OFF and ESMF_MOAB=OFF but no change. Dang.

But, a thought occurred to me chatting with @atrayano. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors (rs_numtiles.x) goes away.

So, I'm wondering if static ESMF means everything GEOS makes has to be static as well?

oehmke commented 5 months ago

That’s too bad. Sorry I haven’t done a lot with spack, but why does ESMF need to be built static only for this? (We should figure out this issue anyway, but I was just wondering why that’s a constraint.)

On Jan 16, 2024, at 2:41 PM, Matthew Thompson @.***> wrote:

As a test, per a suggestion by @oehmke https://github.com/oehmke, I built ESMF with ESMF_PIO=OFF and ESMF_MOAB=OFF but no change. Dang.

But, a thought occurred to me chatting with @atrayano https://github.com/atrayano. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors (rs_numtiles.x) goes away.

So, I'm wondering if static ESMF means everything GEOS makes has to be static as well?

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/209#issuecomment-1894559561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U4TX5DDQ5W45275RNDYO3XX7AVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU2TSNJWGE. You are receiving this because you were mentioned.

tclune commented 5 months ago

NOAA currently wants everything to be static. (Just guessing that this is the reason.)

They may be forced to accept DSO in the future though for multiple reasons. (Vendor/OS may force it and … MAPL3 is going to bake it in fairly deep.

_ Tom

From: oehmke @.> Reply-To: esmf-org/esmf @.> Date: Tuesday, January 16, 2024 at 4:52 PM To: esmf-org/esmf @.> Cc: "Clune, Thomas L. (GSFC-6101)" @.>, Mention @.***> Subject: [EXTERNAL] [BULK] Re: [esmf-org/esmf] Possible missing MPI_Type_free in ESMCI_VMKernel? (Issue #209)

CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.

That’s too bad. Sorry I haven’t done a lot with spack, but why does ESMF need to be built static only for this? (We should figure out this issue anyway, but I was just wondering why that’s a constraint.)

On Jan 16, 2024, at 2:41 PM, Matthew Thompson @.***> wrote:

As a test, per a suggestion by @oehmke https://github.com/oehmke, I built ESMF with ESMF_PIO=OFF and ESMF_MOAB=OFF but no change. Dang.

But, a thought occurred to me chatting with @atrayano https://github.com/atrayano. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors (rs_numtiles.x) goes away.

So, I'm wondering if static ESMF means everything GEOS makes has to be static as well?

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/209#issuecomment-1894559561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U4TX5DDQ5W45275RNDYO3XX7AVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU2TSNJWGE. You are receiving this because you were mentioned.

— Reply to this email directly, view it on GitHubhttps://github.com/esmf-org/esmf/issues/209#issuecomment-1894573885, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABPP7YHFMARGYYOOLOTACHTYO3ZCRAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU3TGOBYGU. You are receiving this because you were mentioned.Message ID: @.***>

climbfuji commented 5 months ago

I think the right way forward is to re-enable the shared esmf build. I just confirmed that if I do that (flip one character in our spack config file), geos builds and runs correctly.

Then we give the UFS folks a heads up that with the next spack-stack release ESMF will be both shared and static, and that they have to fix their build system to correctly pick up the static version (or move away from static libraries - it's a thing of the past anyway).

climbfuji commented 5 months ago

See https://github.com/ufs-community/ufs-weather-model/issues/2094 for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See https://github.com/JCSDA/spack-stack/pull/953 and https://github.com/JCSDA/spack/pull/372 for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static).

I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries.

oehmke commented 5 months ago

If they will accept a shared version, then I agree, that’s we should offer them for now. That should give us time to figure out this other problem, so we can offer a combined static version as well.

On Jan 16, 2024, at 4:00 PM, Dom Heinzeller @.***> wrote:

See ufs-community/ufs-weather-model#2094 https://github.com/ufs-community/ufs-weather-model/issues/2094 for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See JCSDA/spack-stack#953 https://github.com/JCSDA/spack-stack/pull/953 and JCSDA/spack#372 https://github.com/JCSDA/spack/pull/372 for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static).

I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/209#issuecomment-1894657394, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U6AYF4IWF7BBK27ILLYO4BCJAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGY2TOMZZGQ. You are receiving this because you were mentioned.

oehmke commented 5 months ago

Hey Matt, Was this built with debug (-g)? I’m just wondering if we could coax out more info about where this is failing. Also, do you know if this was happening with a version before 8.6 (e.g. 8.5)? I’m trying to narrow down the possibilities. Thanks.

On Jan 16, 2024, at 4:13 PM, Robert Oehmke @.***> wrote:

If they will accept a shared version, then I agree, that’s we should offer them for now. That should give us time to figure out this other problem, so we can offer a combined static version as well.

On Jan 16, 2024, at 4:00 PM, Dom Heinzeller @.***> wrote:

See ufs-community/ufs-weather-model#2094 https://github.com/ufs-community/ufs-weather-model/issues/2094 for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See JCSDA/spack-stack#953 https://github.com/JCSDA/spack-stack/pull/953 and JCSDA/spack#372 https://github.com/JCSDA/spack/pull/372 for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static).

I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries.

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/209#issuecomment-1894657394, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U6AYF4IWF7BBK27ILLYO4BCJAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGY2TOMZZGQ. You are receiving this because you were mentioned.

mathomp4 commented 5 months ago

@oehmke Yup. Both GEOS and ESMF with debugging flags. And even that just gave the four usable lines of traceback.

climbfuji commented 5 months ago

I think this goes all the way back to 8.3.0, maybe beta snapshot 09. Could also be earlier, but we didn't run the UFS with earlier versions of spack-stack, therefore can't tell.

climbfuji commented 5 months ago

Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see https://github.com/JCSDA/spack-stack/issues/956 ...

oehmke commented 5 months ago

This looks like it may be an issue with a fix for tracing we put in for Darwin. Would you try setting ESMF_TRACE_LIB_BUILD=OFF when building ESMF and see if that fixes it? Thanks.

On Jan 16, 2024, at 9:12 PM, Dom Heinzeller @.***> wrote:

Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see JCSDA/spack-stack#956 https://github.com/JCSDA/spack-stack/issues/956 ...

— Reply to this email directly, view it on GitHub https://github.com/esmf-org/esmf/issues/209#issuecomment-1894908262, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U3MHZC7IJPNAZ44CWTYO5FUHAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEYDQMRWGI. You are receiving this because you were mentioned.

climbfuji commented 5 months ago

This looks like it may be an issue with a fix for tracing we put in for Darwin. Would you try setting ESMF_TRACE_LIB_BUILD=OFF when building ESMF and see if that fixes it? Thanks. On Jan 16, 2024, at 9:12 PM, Dom Heinzeller @.***> wrote: Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see JCSDA/spack-stack#956 <JCSDA/spack-stack#956> ... — Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6A7U3MHZC7IJPNAZ44CWTYO5FUHAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEYDQMRWGI. You are receiving this because you were mentioned.

Thanks so much @oehmke, that worked! I'll submit a PR to spack with the change for macOS when building shared ESMF. Sorry for the late reply, all-day meeting today ...