geoschem / Cloud-J

Cloud-J is multi-scattering eight-stream radiative transfer model for solar radiation based on Fast-J. It was originally developed by Michael J. Prather.
GNU General Public License v3.0
3 stars 7 forks source link

SegFault for test run #8

Closed yuanjianz closed 6 months ago

yuanjianz commented 7 months ago

I was using GCHP v14.3 that integrates Cloud-J as a default option. But I encountered a run-time issue in the Cloud-J step (see issues here). I tried to use the standalone version of Cloud-J with Debug mode suggested by @lizziel and found the same SegFault. The log file suggests it could be a problem similar to #1 (see the attached file). I also tried under gfortran and it resulted in another "erroneous arithmetic operation" fault elsewhere.

ifort version 19.1.0.166 intel.log.txt

GNU Fortran (Spack GCC) 10.2.0 gnu.log.txt

lizziel commented 7 months ago

@yuanjianz, thanks for reporting this. @pratherUCI, have you seen either of these run-time errors before?

yuanjianz commented 7 months ago

I changed to an environment on another server and use ifort version 19.1.3.304 this time and it works this time. So it could be something related to compiler version.

pratherUCI commented 7 months ago

I run ifort Fortran 64 v 2021.10.0 (win 11) and have not triggered a segfault recently. Xin Zhu is running Cloud-J v8.0c with linux in our CTM and have no such problems (but we are only doing openMP). There were some problems with cloud overlaps, but that was fixed a while ago, and there appear to be no clouds in the example.

Looking at the common loop problem: fjx_sub_mod.f90:1355 It may be that too many 4D arrays in the FP list may be troublesome That section (the call GEN_ID) could be rewritten with simple 3D transfer arrays with a little extra code, abut it might be simpler?

Sorry no help here, Michael

yuanjianz commented 7 months ago

Hi @pratherUCI, thanks for your suggestion. Maybe we should try later. On the other hand, it seems that Cloud-J is working well under another environment that I meant to run with(The previous errors occured in my test environment). Since you also provides many cases that the SegFault does not appear, it appears to be only a compiler version compatibility issue.

lizziel commented 7 months ago

For what it's worth the Intel library I use on the NASA discover cluster is Intel 2021.5.0. It works fine. Given that newer version are working (rather than the other way around) I think we shouldn't pursue trying to fix it. However, the gfortran error is curious. I am using GNU 10.2.0 at Harvard without issue.

yuanjianz commented 6 months ago

Hi @lizziel, sorry to catch up late. I double checked that my GNU version 10.2.0 was indeed. I fixed the GNU problem above by turning off OMP. However, it does not work for Intel. There still exists a SegFault in the BLKSLV subroutine. But surely it seems to be only a compiler version issue.

lizziel commented 6 months ago

Hi @yuanjianz, thanks for the tip about OMP. I'll look into what the issue is there.

pratherUCI commented 6 months ago

Re OMP, we run it with Intel at UCI with our ctm. Never had a problem. But of course the OMP directives are not within cloud-j only at main program.

Michael Prather - sent from phone, brief w/ odd Otto-complete

lizziel commented 6 months ago

It turns out I mispoke about Cloud-J working with GNU 10.2.0 on our cluster. I was able to reproduce the seg fault, both with and without OMP on, and also for GNU 12.2.0.

lizziel commented 6 months ago

I should add that we do not see this when using Cloud-J in GEOS-Chem, GEOS, or CESM, so this is something specific to Cloud-J standalone (at least for the compilers I am testing with since GCHP works fine with the same environment). I will dig more into it to try to find a fix.

yantosca commented 6 months ago

Hi @lizziel @pratherUCI @yuanjianz. I did a quick test to build the Cloud-J standalone with debug flags on and to run it with the GNU debugger. I used these commands:

$ git clone git@github.com:geoschem/Cloud-J
$ cd Cloud-J/
$ mkdir debug
$ cd debug
$ cmake ..
$ cmake . -DCMAKE_BUILD_TYPE=Debug
$ make -j
$ make install
$ cd ..
$ mkdir run
$ cd run
$ ln -s ../tables/ .
$ cp ../debug/bin/cloudj_standalone .
$ gdb cloudj_standalone
(gdb) run

and I got this error output:

Program received signal SIGFPE, Arithmetic exception.
0x0000000000417b62 in ica_qud (
    wcol=<error reading variable: value requires 160000 bytes, which is more than max-value-size>, 
    ocol=<error reading variable: value requires 160000 bytes, which is more than max-value-size>, 
    ltop=34, icau=20000, nqdu=4, nica=32, wtqca=..., 
    isort=<error reading variable: value requires 80000 bytes, which is more than max-value-size>, 
    nq1=..., nq2=..., ndxqs=...)
    at /n/holylfs05/LABS/jacob_lab/ryantosca/tests/cldj/Cloud-J/src/Core/cld_sub_mod.f90:1084
1084           do while (OCOLS(I).lt.OD_QUAD(N) .and. I.le.NICA)
Missing separate debuginfos, use: yum debuginfo-install brotli-1.0.6-3.el8.x86_64 cyrus-sasl-lib-2.1.27-6.el8_5.x86_64 glibc-2.28-236.el8_9.12.x86_64 hwloc-libs-2.2.0-3.el8.x86_64 keyutils-libs-1.5.10-9.el8.x86_64 krb5-libs-1.18.2-22.el8_7.x86_64 libcom_err-1.46.6-wc1.el8.x86_64 libcurl-7.61.1-25.el8_7.1.x86_64 libevent-2.1.8-5.el8.x86_64 libidn2-2.2.0-1.el8.x86_64 libnghttp2-1.33.0-3.el8_3.1.x86_64 libnl3-3.7.0-1.el8.x86_64 libpsl-0.20.2-6.el8.x86_64 libselinux-2.9-6.el8.x86_64 libssh-0.9.6-3.el8.x86_64 libunistring-0.9.9-3.el8.x86_64 openldap-2.4.46-18.el8.x86_64 openssl-libs-1.1.1k-12.el8_9.x86_64 pcre2-10.32-3.el8_6.x86_64 ucx-1.15.0-1fasrc01.el8.x86_64
(gdb) print i
$1 = 33
(gdb) print ocols(i)
$2 = nan(0x4000000000000)
(gdb) print od_quad(n)
$3 = 30

The variable OCOLS(33) is NaN at line 1084 of cld_sub_mod.f90. I believe this is an initialization issue, because OCOLS has not been initialized to zero before being used.

The fix is to set OCOLS to zero in the ICA_QUD routine

!-----------------------------------------------------------------------
      SUBROUTINE ICA_QUD(WCOL,OCOL, LTOP,ICAU,NQDU,NICA, &
                         WTQCA, ISORT,NQ1,NQ2,NDXQS)
!-----------------------------------------------------------------------
!---Take the full set of ICAs and group into the NQD_ ranges of total OD
!---Create the Cumulative Prob Fn and select the mid-point ICA for each group
!---The Quad atmospheres have weights WTQCA
!-----------------------------------------------------------------------
      implicit none
      integer, intent(in)        :: LTOP,ICAU,NQDU,NICA
      real*8,  intent(in), dimension(ICAU)      :: WCOL,OCOL

      real*8, intent(out), dimension(NQDU)      :: WTQCA
      integer, intent(out), dimension(ICAU)     :: ISORT
      integer, intent(out), dimension(NQDU)     :: NQ1,NQ2,NDXQS

      real*8,  dimension(ICA_) :: OCDFS, OCOLS
      integer I, II, J, L, N, N1, N2

      real*8, parameter:: OD_QUAD(4) =[0.5d0, 4.0d0, 30.d0, 1.d9]
!-----------------------------------------------------------------------
      ISORT(:) = 0
      WTQCA(:)  = 0.d0
      NDXQS(:) = 0
      OCOLS(:) = 0.d0   ! <== add this line

When you configure CloudJ with the -DCMAKE_RELEASE_TYPE=Debug option, the code will be compiled with these debugging flags (for GNU):

-ffpe-trap=invalid,zero,overflow -finit-real=snan

The -ffpe-trap will cause the code to halt if it encounters certain floating-point exceptions. The -finit-real=snan will set all variables that are not initialized in the code to a silent NaN value. This makes it easier for the code to die with a floating point exception (thus alerting you to the uninitialized value).

When I recompiled with the fix, the standalone code was able to run to completion.

yuanjianz commented 6 months ago

Hi @lizziel @yantosca. I want to clarify that my GNU compiled version died with the SIGFPE fault under DEBUG mode just like what @yantosca posted above. I then made a RELEASE GNU version, which works fine after disabling OMP.

I also want to share what I found when I tried to locate the SEGFAULT issue with my intel compiler. It seems indeed relevant to the 4D arrays. I commented out all lines that include 4D arrays in computation and the SEGFAULT does not appear again. I doubt it is because old version of ifort cannot handle too many large 4D arrays.

lizziel commented 6 months ago

Thank you @yantosca! I will add this fix to the PR I am working on that will be merged soon.

yantosca commented 6 months ago

Hi @yuanjianz, I was able to build and run the Cloud-J standalone with this version of Intel:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.8.0 Build 20221119_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.
yantosca commented 6 months ago

Hi @lizziel @yantosca. I want to clarify that my GNU compiled version died with the SIGFPE fault under DEBUG mode just like what @yantosca posted above. I then made a RELEASE GNU version, which works fine after disabling OMP.

I also want to share what I found when I tried to locate the SEGFAULT issue with my intel compiler. It seems indeed relevant to the 4D arrays. I commented out all lines that include 4D arrays in computation and the SEGFAULT does not appear again. I doubt it is because old version of ifort cannot handle too many large 4D arrays.

Hi @yuanjianz... thanks for the clarification. The release GNU version might still have the error (or the optimizer might flush the uniitialized variables to zero). You might also try increasing the stack memory limit in your shell:

ulimit -s unlimited

The stack memory is where many temporary variables get allocated in your shell. Typically this is set by default to a low value but the command above will max it out.

yuanjianz commented 6 months ago

Hi @yantosca, thanks for your reminder!


Hi @yuanjianz, I was able to build and run the Cloud-J standalone with this version of Intel:

$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.8.0 Build 20221119_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

My intel compiler info:

bash-4.2$ ifort -V
Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 19.1.0.166 Build 20191121
Copyright (C) 1985-2019 Intel Corporation.  All rights reserved.

Hi @lizziel @yantosca. I want to clarify that my GNU compiled version died with the SIGFPE fault under DEBUG mode just like what @yantosca posted above. I then made a RELEASE GNU version, which works fine after disabling OMP. I also want to share what I found when I tried to locate the SEGFAULT issue with my intel compiler. It seems indeed relevant to the 4D arrays. I commented out all lines that include 4D arrays in computation and the SEGFAULT does not appear again. I doubt it is because old version of ifort cannot handle too many large 4D arrays.

Hi @yuanjianz... thanks for the clarification. The release GNU version might still have the error (or the optimizer might flush the uniitialized variables to zero). You might also try increasing the stack memory limit in your shell:

ulimit -s unlimited

The stack memory is where many temporary variables get allocated in your shell. Typically this is set by default to a low value but the command above will max it out.

Unlimiting stack size fixed my cloud_standalone SegFault. I am sorry that I forgot about this part. But the interesting thing is in my GCHP run script I included all of them to prevent OS killing my jobs.

ulimit -c 0                  # coredumpsize
ulimit -l unlimited          # memorylocked
ulimit -u 50000              # maxproc
ulimit -v unlimited          # vmemoryuse
ulimit -s unlimited          # stacksize

The SegFault still happens as my original GCHP issue.

yantosca commented 6 months ago

Thanks for the update @yuanjianz. Right now the interface to GEOS-Chem and GCHP is in the geos-chem branch. This does not as yet have the fix I mentioned above (but will once we merge @lizziel's PR #2).

yuanjianz commented 6 months ago

I am closing this issue as I figured out the issue in my environment. This is fundamentally related to stack size memory limits for intel compiler rather than a problem with the compiler version. For standalone users, unlimiting stack size in shell can fix the problem easily. However, it might cause bigger issues for GCHP users. I will explain with more details in the related GCHP issue geoschem/GCHP#387.