geoschem / geos-chem

GEOS-Chem "Science Codebase" repository. Contains GEOS-Chem science routines, run directory generation scripts, and interface code. This repository is used as a submodule within the GCClassic and GCHP wrappers, as well as in other modeling contexts (external ESMs).
http://geos-chem.org
Other
167 stars 162 forks source link

Thread 1 "gcclassic" received signal SIGSEGV, Segmentation fault. #2502

Open YueZhang720 opened 2 weeks ago

YueZhang720 commented 2 weeks ago

Your name

Yue Zhang

Your affiliation

HKUST(GZ)

What happened? What did you expect to happen?

I'm running a full chemistry simulation from 2019/7/01-2019/08/01. My log file contains the following errors:

---> DATE: 2019/07/01  UTC: 00:00
 HEMCO already called for this timestep. Returning.
 Getting CH4 boundary conditions in GEOS-Chem from :NOAA_GMD_CH4
real 181.08
user 1026.99
sys 29.57

What are the steps to reproduce the bug?

Then I try to use gdb to backtrace the error:

********************************************
* B e g i n   T i m e   S t e p p i n g !! *
********************************************

---> DATE: 2019/07/01  UTC: 00:00
 HEMCO already called for this timestep. Returning.
 Getting CH4 boundary conditions in GEOS-Chem from :NOAA_GMD_CH4

Thread 1 "gcclassic" received signal SIGSEGV, Segmentation fault.
0x000000000105972a in blkslv (fj=<error reading variable: Cannot access memory at address 0x7fffff3ed118>, pomega=<error reading variable: Cannot access memory at address 0x7fffff3ed110>, fz=<error reading variable: Cannot access memory at address 0x7fffff3ed108>, ztau=<error reading variable: Cannot access memory at address 0x7fffff3ed100>, fsbot=<error reading variable: Cannot access memory at address 0x7fffff3ed0f8>, rfl=<error reading variable: Cannot access memory at address 0x7fffff3ed0f0>, pm=..., pm0=..., fjtop=..., fjbot=..., fibot=..., ldokr=..., nd=165) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1372
1372          subroutine BLKSLV &
(gdb) backtrace
#0  0x000000000105972a in blkslv (fj=<error reading variable: Cannot access memory at address 0x7fffff3ed118>, 
    pomega=<error reading variable: Cannot access memory at address 0x7fffff3ed110>, 
    fz=<error reading variable: Cannot access memory at address 0x7fffff3ed108>, 
    ztau=<error reading variable: Cannot access memory at address 0x7fffff3ed100>, 
    fsbot=<error reading variable: Cannot access memory at address 0x7fffff3ed0f8>, 
    rfl=<error reading variable: Cannot access memory at address 0x7fffff3ed0f0>, pm=..., pm0=..., fjtop=..., fjbot=..., fibot=..., 
    ldokr=..., nd=165) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1372
#1  0x0000000001062011 in miesct (fj=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, fjt=..., 
    fjb=..., fib=..., pomega=<error reading variable: value requires 1038528 bytes, which is more than max-value-size>, 
    fz=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, 
    ztau=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, fsbot=..., rfl=..., 
    u0=0.41925676950732299, ldokr=..., nd=165) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1339
#2  0x00000000010683f6 in opmie (dtaux=..., 
    pomegax=<error reading variable: value requires 82944 bytes, which is more than max-value-size>, u0=0.41925676950732299, rfl=..., 
    amf=..., amg=..., jxtra=..., fjact=..., fjtop=..., fjbot=..., fibot=..., fsbot=..., fjflx=..., flxd=..., flxd0=..., ldokr=..., lu=47, 
    rc=0) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1243
#3  0x00000000010761d0 in photo_jx (u0=0.41925676950732299, sza=65.212326865588082, rfl=..., solf=0.96650251637029394, lprtj=.FALSE., 
--Type <RET> for more, q to quit, c to continue without paging--RET
    ppp=..., zzz=..., ttt=..., hhh=..., ddd=..., rrr=..., ooo=..., ccc=..., lwp=..., iwp=..., reffl=..., reffi=..., aersp=..., ndxaer=..., 
    l1u=48, anu=37, njxu=166, valjxx=..., skperd=..., swmsq=..., od18=..., ldark=.FALSE., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:587
#4  0x0000000001034cd3 in cloud_jx (u0=0.41925676950732299, sza=65.212326865588082, rfl=..., solf=0.96650251637029394, lprtj=.FALSE., 
    ppp=..., zzz=..., ttt=..., hhh=..., ddd=..., rrr=..., ooo=..., ccc=..., lwp=..., iwp=..., reffl=..., reffi=..., cldf=..., 
    cldcor=0.33000001311302185, cldiw=..., aersp=..., ndxaer=..., l1u=48, anu=37, njxu=166, valjxx=..., skperd=..., swmsq=..., od18=..., 
    iran=1, nica=0, jcount=0, ldark=.FALSE., wtqca=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_sub_mod.F90:183
#5  0x0000000000b75b4b in __cldj_interface_mod_MOD_run_cloudj._omp_fn.0 ()
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/cldj_interface_mod.F90:898
#6  0x00007ffff79d0166 in GOMP_parallel (fn=0xb73218 <__cldj_interface_mod_MOD_run_cloudj._omp_fn.0>, data=0x7ffffffd9810, num_threads=32, 
    flags=0) at /tmp/jingzhoujiang/spack-stage/spack-stage-gcc-14.2.0-ynlf3gta5n3oegqmhph3urmk6k26txce/spack-src/libgomp/parallel.c:178
#7  0x0000000000b72ba0 in run_cloudj (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/cldj_interface_mod.F90:415
#8  0x00000000006df8ee in do_photolysis (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/photolysis_mod.F90:538
#9  0x00000000004e8a50 in do_fullchem (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
--Type <RET> for more, q to quit, c to continue without paging--RET
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/fullchem_mod.F90:393
#10 0x00000000004395de in do_chemistry (input_opt=..., state_chm=..., state_diag=..., state_grid=..., state_met=..., rc=0)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/GeosCore/chemistry_mod.F90:248
#11 0x000000000040b372 in geos_chem () at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:1456
#12 0x000000000040ed17 in main (argc=1, argv=0x7ffffffef2d2)
    at /public/home/jingzhoujiang/gcruns/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:32
#13 0x00007ffff7740d90 in __libc_start_call_main (main=main@entry=0x40eccf <main>, argc=argc@entry=1, argv=argv@entry=0x7ffffffeee28)
    at ../sysdeps/nptl/libc_start_call_main.h:58
#14 0x00007ffff7740e40 in __libc_start_main_impl (main=0x40eccf <main>, argc=1, argv=0x7ffffffeee28, init=<optimized out>, 
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffffffeee18) at ../csu/libc-start.c:392
#15 0x0000000000405b15 in _start ()

What should I do next?

Please attach any relevant configuration and log files.

GC.log gcfiles.zip

What GEOS-Chem version were you using?

14.4.3

What environment were you running GEOS-Chem on?

Local cluster

What compiler and version were you using?

gcc 14.2.0

Will you be addressing this bug yourself?

Yes, but I will need some help

In what configuration were you running GEOS-Chem?

GCClassic

What simulation were you running?

Full chemistry

As what resolution were you running GEOS-Chem?

2x2.5

What meterology fields did you use?

MERRA-2

Additional information

No response

yantosca commented 2 weeks ago

Thanks for writing @YueZhang720. I wonder if this is something specific to the gcc 14.2.0 compilers. I can try to replicate that but I will first need to build libraries with spack for 14.2.0 so it may take me a while to get to this.

There are a couple of errors like this:

pomega=<error reading variable: value requires 1038528 bytes, which is more than max-value-size>, 
    fz=<error reading variable: value requires 129816 bytes, which is more than max-value-size>, 

so it may also be an issue on your cluster.

If you have an older version of the GCC compilers (like 12.2.0) available, try that and see if you get the same error.

yantosca commented 2 weeks ago

Also which version of gdb are you using? You can type gdb --version to get that information.

YueZhang720 commented 1 week ago

Also which version of gdb are you using? You can type gdb --version to get that information. Thanks for your help. This is my gdb version:

jingzhoujiang@login01:~/gcruns$ gdb --version
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

I have tried gcc 12.2.0, but it still has a segmentation fault with more errors. Is there anything wrong with my software or environment setting?


At line 507 of file /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90
Fortran runtime error: Index '2146697216' of dimension 1 of array 'ooj' above upper bound of 48

Error termination. Backtrace:

Thread 30 "gcclassic" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7ffa5e289640 (LWP 3071783)] 0x0000000001063324 in opmie (dtaux=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, pomegax=<error reading variable: value requires 12610078956637388800 bytes, which is more than max-value-size>, u0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, rfl=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, amf=..., amg=<error reading variable: value requires 18419722475945328640 bytes, which is more than max-value-size>, jxtra=<error reading variable: value requires 18433233274827440128 bytes, which is more than max-value-size>, fjact=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, fjtop=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fjbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fibot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fsbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, fjflx=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, flxd=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, flxd0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, ldokr=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, lu=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, rc=<error reading variable: Cannot access memory at address 0x7ff4000000000000>) at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1040 1040 DTAU1(L) = DTAUX(L,K) * AMG(L) (gdb) backtrace

0 0x0000000001063324 in opmie (dtaux=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>,

pomegax=<error reading variable: value requires 12610078956637388800 bytes, which is more than max-value-size>, 
u0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
rfl=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, amf=..., 
amg=<error reading variable: value requires 18419722475945328640 bytes, which is more than max-value-size>, 
jxtra=<error reading variable: value requires 18433233274827440128 bytes, which is more than max-value-size>, 
fjact=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
fjtop=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
fjbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
fibot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
fsbot=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
fjflx=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
flxd=<error reading variable: value requires 17717160934075531264 bytes, which is more than max-value-size>, 
flxd0=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
ldokr=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
lu=<error reading variable: Cannot access memory at address 0x7ff4000000000000>, 
rc=<error reading variable: Cannot access memory at address 0x7ff4000000000000>)

at /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90:1040

1 0x7ff4000000000000 in ?? ()

2 0x7ff4000000000000 in ?? ()

3 0x7ff4000000000000 in ?? ()

4 0x7ff4000000000000 in ?? ()

5 0x7ff4000000000000 in ?? ()

6 0x7ff4000000000000 in ?? ()

7 0x7ff4000000000000 in ?? ()

8 0x7ff4000000000000 in ?? ()

9 0x7ff4000000000000 in ?? ()

10 0x7ff4000000000000 in ?? ()

11 0x7ff4000000000000 in ?? ()

12 0x7ff4000000000000 in ?? ()

13 0x7ff4000000000000 in ?? ()

14 0x7ff4000000000000 in ?? ()

15 0x7ff4000000000000 in ?? ()

16 0x7ff4000000000000 in ?? ()

0x7ff4000000000000 in ?? ()

18 0x7ff4000000000000 in ?? ()

19 0x7ff4000000000000 in ?? ()

20 0x7ff4000000000000 in ?? ()

21 0x7ff4000000000000 in ?? ()

22 0x7ff4000000000000 in ?? ()

23 0x7ff4000000000000 in ?? ()

24 0x7ff4000000000000 in ?? ()

25 0x7ff4000000000000 in ?? ()

26 0x7ff4000000000000 in ?? ()

27 0x7ff4000000000000 in ?? ()

28 0x7ff4000000000000 in ?? ()

29 0x7ff4000000000000 in ?? ()

30 0x7ff4000000000000 in ?? ()

31 0x7ff4000000000000 in ?? ()

32 0x7ff4000000000000 in ?? ()

33 0x7ff4000000000000 in ?? ()

--Type for more, q to quit, c to continue without paging--RET

34 0x7ff4000000000000 in ?? ()

35 0x7ff4000000000000 in ?? ()

36 0x7ff4000000000000 in ?? ()

37 0x7ff4000000000000 in ?? ()

38 0x7ff4000000000000 in ?? ()

39 0x7ff4000000000000 in ?? ()

40 0x7ff4000000000000 in ?? ()

41 0x7ff4000000000000 in ?? ()

42 0x7ff4000000000000 in ?? ()

43 0x7ff4000000000000 in ?? ()

44 0x7ff4000000000000 in ?? ()

45 0x7ff4000000000000 in ?? ()

46 0x7ff4000000000000 in ?? ()

47 0x7ff4000000000000 in ?? ()

48 0x7ff4000000000000 in ?? ()

49 0x7ff4000000000000 in ?? ()

50 0x7ff4000000000000 in ?? ()

yantosca commented 5 days ago

Thanks @YueZhang720. This is an out-of-bounds error in Cloud-J:

At line 507 of file /public/home/jingzhoujiang/gcruns/CodeDir/src/Cloud-J/src/Core/cldj_fjx_sub_mod.F90
Fortran runtime error: Index '2146697216' of dimension 1 of array 'ooj' above upper bound of 48

@lizziel: Didn't we see a similar issue in Cloud-J before?