E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 364 forks source link

munmap_chunk(): invalid pointer with SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang #4963

Closed ndkeen closed 11 months ago

ndkeen commented 2 years ago

Using AMD compiler on pm-cpu, SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang I see the following error with DEBUG attempt:

19: munmap_chunk(): invalid pointer
48: munmap_chunk(): invalid pointer
53: munmap_chunk(): invalid pointer
24: munmap_chunk(): invalid pointer
 1: corrupted size vs. prev_size
44: corrupted size vs. prev_size

Note to compile with AMD, we still need this work-around: https://github.com/E3SM-Project/E3SM/issues/4949

When I tried this again with Jan 2023 master, I now see a file run/log.seaice.0046.err that I don't think was there before.

----------------------------------------------------------------------
Beginning MPAS-seaice Error Log File for task      46 of      64
    Opened at 2023/01/23 13:13:02
----------------------------------------------------------------------

ERROR: No exchange group found named 'TEMPSingleFieldGroup'.  Cannot destroy group.
ndkeen commented 1 year ago

Allowing the writing of core files, I learn:

#0  0x000015223f894cdb in raise () from /lib64/libc.so.6
#1  0x000015223f896375 in abort () from /lib64/libc.so.6
#2  0x000015223f8dab07 in __libc_message () from /lib64/libc.so.6
#3  0x000015223f8e2b8a in malloc_printerr () from /lib64/libc.so.6
#4  0x000015223f8e2e5c in munmap_chunk () from /lib64/libc.so.6
#5  0x0000152241249633 in f90_dealloc03a_i8 () from /opt/AMD/aocc-compiler-3.2.0/bin/../lib/libflang.so
#6  0x0000000003797f65 in mpas_dmpar::mpas_dmpar_destroy_communication_list ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_dmpar.f90:6013
#7  0x00000000037a8de8 in mpas_dmpar::mpas_dmpar_exch_group_destroy_buffers ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_dmpar.f90:8198
#8  0x00000000037a1b05 in mpas_dmpar::mpas_dmpar_exch_group_full_halo_exch ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_dmpar.f90:6961
#9  0x00000000037a1f13 in mpas_dmpar::mpas_dmpar_field_halo_exch ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_dmpar.f90:7016
#10 0x000000000382aeb4 in mpas_stream_manager::exch_all_halos ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_stream_manager.f90:4739
#11 0x0000000003827fbd in mpas_stream_manager::read_stream ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_stream_manager.f90:4023
#12 0x0000000003824c74 in mpas_stream_manager::mpas_stream_mgr_read ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/framework/mpas_stream_manager.f90:3546
#13 0x000000000373ec42 in seaice_core::seaice_core_init ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/bld/cmake-bld/core_seaice/model_forward/mpas_seaice_core.f90:111
#14 0x0000000002fb764d in ice_comp_mct::ice_init_mct ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/SMS_D.ne4pg2_oQU480.F2010.pm-cpu_amdclang.20230123_125148_wf4qjc/mpas-seaice/driver/ice_comp_mct.f90:621
#15 0x000000000063d79a in component_mod::component_init_cc () at /global/cfs/cdirs/e3sm/ndk/repos/me11-jan12/driver-mct/main/component_mod.F90:257
#16 0x000000000060cfeb in cime_comp_mod::cime_init () at /global/cfs/cdirs/e3sm/ndk/repos/me11-jan12/driver-mct/main/cime_comp_mod.F90:1464
#17 0x000000000063b271 in cime_driver () at /global/cfs/cdirs/e3sm/ndk/repos/me11-jan12/driver-mct/main/cime_driver.F90:122

Similar stack for SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang

#0  0x0000151ad9aaacdb in raise () from /lib64/libc.so.6
#1  0x0000151ad9aac375 in abort () from /lib64/libc.so.6
#2  0x0000151ad9af0b07 in __libc_message () from /lib64/libc.so.6
#3  0x0000151ad9af8b8a in malloc_printerr () from /lib64/libc.so.6
#4  0x0000151ad9afa94c in _int_free () from /lib64/libc.so.6
#5  0x0000151adb45f633 in f90_dealloc03a_i8 () from /opt/AMD/aocc-compiler-3.2.0/bin/../lib/libflang.so
#6  0x0000000000d0e435 in mpas_dmpar::mpas_dmpar_destroy_communication_list ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_dmpar.f90:6013
#7  0x0000000000d1f2a2 in mpas_dmpar::mpas_dmpar_exch_group_destroy_buffers ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_dmpar.f90:8197
#8  0x0000000000d17fd5 in mpas_dmpar::mpas_dmpar_exch_group_full_halo_exch ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_dmpar.f90:6961
#9  0x0000000000d183e3 in mpas_dmpar::mpas_dmpar_field_halo_exch ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_dmpar.f90:7016
#10 0x0000000000da1384 in mpas_stream_manager::exch_all_halos ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_stream_manager.f90:4739
#11 0x0000000000d9e48d in mpas_stream_manager::read_stream ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_stream_manager.f90:4023
#12 0x0000000000d9b144 in mpas_stream_manager::mpas_stream_mgr_read ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/framework/mpas_stream_manager.f90:3546
#13 0x0000000000cb5112 in seaice_core::seaice_core_init ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/bld/cmake-bld/core_seaice/model_forward/mpas_seaice_core.f90:111
#14 0x000000000052db1d in ice_comp_mct::ice_init_mct ()
    at /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/me11-jan12/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.r00/mpas-seaice/driver/ice_comp_mct.f90:621
#15 0x00000000003c822a in component_mod::component_init_cc () at /global/cfs/cdirs/e3sm/ndk/repos/me11-jan12/driver-mct/main/component_mod.F90:257
#16 0x0000000000397a7b in cime_comp_mod::cime_init () at /global/cfs/cdirs/e3sm/ndk/repos/me11-jan12/driver-mct/main/cime_comp_mod.F90:1464
#17 0x00000000003c5d01 in cime_driver () at /global/cfs/cdirs/e3sm/ndk/repos/me11-jan12/driver-mct/main/cime_driver.F90:122
ndkeen commented 1 year ago

Using master of July, I get a compiler build error, which looks like issue with compiler. Will add this here now and come back later

cd /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/cpl && python3 /pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/Tools/e3s\
m_compile_wrap.py  /opt/cray/pe/craype/2.7.19/bin/ftn -DCPRAMD -DFORTRANUNDERSCORE -DHAVE_MPI -DLinux -DMCT_INTERFACE -DNO_R16 -DYAKL_DEBUG -D_PNETCDF -I/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/components/cmake/cpl/. -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu\
/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/amdclang/mpich/debug/nothreads/mct/include -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/amdclang/mpich/debug/nothreads/m\
ct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/opt/cray/pe/netcdf-hdf5parallel/4.9.0.3/aocc/3.0/include -I/opt/cray/pe/parallel-netcdf/1.12.3.3/aocc/3.0/include -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang\
.gh4963/bld/cmake-bld/mpas-framework/src -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/cpl -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.p\
m-cpu_amdclang.gh4963/bld/cmake-bld/cmake/atm -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/lnd -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTE\
STM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/ice -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/ocn -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v\
3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/rof -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/glc -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60\
to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/wav -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/iac -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_\
oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/cmake-bld/cmake/esp -I/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/SourceMods/src.drv -I/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main -I/p\
scratch/sd/n/ndk/e3sm_scratch/pm-cpu/nexty-jul18/SMS_D_Ld1.T62_oEC60to30v3.DTESTM.pm-cpu_amdclang.gh4963/bld/lnd/obj    -O0 -g -Mflushz    -Mfreeform -DUSE_CONTIGUOUS=  -c /global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main/prep_glc_mod.F90 -o CMakeFil\
es/e3sm.exe.dir/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main/prep_glc_mod.F90.o
/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin/cmake -E touch cmake/cpl/CMakeFiles/e3sm.exe.dir/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main/cplcomp_exchange_mod.F90.o.provides.build
/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin/cmake -E touch cmake/cpl/CMakeFiles/e3sm.exe.dir/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main/prep_iac_mod.F90.o.provides.build
/global/common/software/nersc/pm-2022q4/spack/linux-sles15-zen/cmake-3.24.3-k5msymx/bin/cmake -E touch cmake/cpl/CMakeFiles/e3sm.exe.dir/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main/prep_rof_mod.F90.o.provides.build
clang-13: error: unable to execute command: Segmentation fault
clang-13: error: Fortran frontend to LLVM command failed due to signal (use -v to see invocation)
AMD clang version 13.0.0 (CLANG: AOCC_3.2.0-Build#128 2021_11_12) (based on LLVM Mirror.Version.13.0.0)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/AMD/aocc-compiler-3.2.0/include/../bin
clang-13: note: diagnostic msg: Error generating preprocessed source(s).
Target CMakeFiles/e3sm.exe.dir/global/cfs/cdirs/e3sm/ndk/repos/nexty-jul18/driver-mct/main/prep_glc_mod.F90.o built in 1.925062 seconds
ndkeen commented 11 months ago

With master of Nov 29th, this is no longer failing. May have been another issue fixed with newer AMD compiler version done in https://github.com/E3SM-Project/E3SM/pull/6003