geoschem / geos-chem

GEOS-Chem "Science Codebase" repository. Contains GEOS-Chem science routines, run directory generation scripts, and interface code. This repository is used as a submodule within the GCClassic and GCHP wrappers, as well as in other modeling contexts (external ESMs).
http://geos-chem.org
Other
170 stars 166 forks source link

Program received signal SIGSEGV: Segmentation fault - invalid memory reference. #2526

Closed gopikrishnangs44 closed 1 month ago

gopikrishnangs44 commented 1 month ago

Your name

Gopikrishnan

Your affiliation

Columbia University, NY

Please provide a clear and concise description of your question or discussion topic.

I am running GCClassic version 14.4.3. Geos Chem runs perfect with the MERRA2 4x5 gridding simulations and I saved the BC to the Outputdirs.

The error occurs when I try to run the nested version the model. The time stepping begins and the model just stops as TP core.

********************************************
* B e g i n   T i m e   S t e p p i n g !! *
********************************************

---> DATE: 2018/01/01  UTC: 00:00
 HEMCO already called for this timestep. Returning.
 Getting CH4 boundary conditions in GEOS-Chem from :NOAA_GMD_CH4
NASA-GSFC Tracer Transport Module successfully initialized

Both my slurm script and the env file has

export OMP_NUM_THREADS=32
export F_UFMTENDIAN=big
export OMP_STACKSIZE=3000m
ulimit -s unlimited

And these are the memory asked for in the shared system

#!/bin/bash
#SBATCH -A fiore         # Account
#SBATCH --job-name=GC_run    # The job name
#SBATCH -c 32                # Number of cores
#SBATCH -N 1                 # Ensure that all cores are on one machine
#SBATCH -t 0-02:00           # Runtime in D-HH:MM
#SBATCH --exclusive           # Memory pool for all cores
#SBATCH -C mem768
#SBATCH --mem=700GB

The slurm output is

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

Backtrace for this error:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x15555363eb4f in ???
geoschem/GCClassic#1  0x5c0baa in tp2d
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/tpcore_window_mod.F90:1471
#0  0x15555363eb4f in ???
geoschem/GCClassic#1  0x5c0baa in tp2d
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/tpcore_window_mod.F90:1471
geoschem/GCClassic#2  0x5c9711 in __tpcore_window_mod_MOD_air_mass_flux._omp_fn.1
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/tpcore_window_mod.F90:1155
geoschem/GCClassic#3  0x15555404f01d in gomp_thread_start
        at /local/gg2995/spack-stage/spack-stage-gcc-10.2.0-ca5qgfppgl4tppua3svba6vehsuvjayp/spack-src/libgomp/team.c:123
#0  0x15555363eb4f in ???
geoschem/GCClassic#1  0x5c0baa in tp2d
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/tpcore_window_mod.F90:1471
geoschem/GCClassic#2  0x5c9711 in __tpcore_window_mod_MOD_air_mass_flux._omp_fn.1
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/tpcore_window_mod.F90:1155
geoschem/GCClassic#3  0x15555404f01d in gomp_thread_start
        at /local/gg2995/spack-stage/spack-stage-gcc-10.2.0-ca5qgfppgl4tppua3svba6vehsuvjayp/spack-src/libgomp/team.c:123
geoschem/GCClassic#4  0x1555539bd1c9 in ???
geoschem/GCClassic#5  0x155553629e72 in ???
geoschem/GCClassic#6  0xffffffffffffffff in ???
real 91.97
user 526.29
sys 15.03
srun: error: g277: task 0: Exited with exit code 139

Please see the issue.

yantosca commented 1 month ago

I will transfer this issue to the GEOS-Chem "science codebase" repository. The GCClassic issue tracker is for issues pertaining to the GCClassic wrapper itself.

gopikrishnangs44 commented 1 month ago

@yantosca Thank you for the response. I will wait for the solution for the same.

yantosca commented 1 month ago

Thanks @gopikrishnangs44. I think your job might have exceeded the available memory on the node. Are you using cropped met field data for the nested-grid simulation? That will reduce both memory and run time. See the Crop netCDF files Chapter on ReadTheDocs for more information.

Also we have some information about [Segmentation faults and similar errors[(https://geos-chem.readthedocs.io/en/latest/geos-chem-shared-docs/supplemental-guides/error-guide.html#segmentation-faults-and-similar-errors) on ReadTheDocs.

Could you attach the following to this issue?

Another way to reduce memory usage is to only archive the species that you need for diagnostics rather than all species. For example, if you only wanted to save out CO and O3 in the SpeciesConc collection, you can list individual fields
SpeciesConcVV_CO, SpeciesConcVV_O3 instead of SpeciesConc_?ADV?, which would save out all advected species.

gopikrishnangs44 commented 1 month ago

I am trying to save the restart file only. I just tried a re-run. Now the error is

corrupted double-linked list

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x15555363eb4f in ???
#1  0x15555363eacf in ???
#2  0x155553611ea4 in ???
#3  0x15555367fcc6 in ???
#4  0x155553686fcb in ???
#5  0x15555368785b in ???
#6  0x155553688efa in ???
#7  0x5dcec8 in do_window_transport
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/transport_mod.F90:578
#8  0x5dcec8 in __transport_mod_MOD_do_transport
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/transport_mod.F90:220
#9  0x407a33 in geos_chem
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:1164
#10  0x405556 in main
        at /burg/fiore_new/users/gg2995/test1/gc_05x0625_merra2_fullchem/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:32
real 145.72
user 545.23
sys 16.11
srun: error: g123: task 0: Exited with exit code 134

Attaching the files for your reference

GC_run.log slurm-17814707.txt.txt geoschem_config.txt.txt

yantosca commented 1 month ago

Thanks @gopikrishnangs44. It very much seems like the second error described in this chapter on ReadTheDocs:

But I'm curious as you have the stacksize limits maxed out. What type of system are you using?

gopikrishnangs44 commented 1 month ago

I have treid increasing the stack size using the link.

export OMP_NUM_THREADS=32
export F_UFMTENDIAN=big
export OMP_STACKSIZE=3000m
ulimit -s unlimited

I am using the ginsburg cluster in Columbia,. https://columbiauniversity.atlassian.net/wiki/spaces/rcs/pages/62141888/Ginsburg+-+Technical+Information

PS: I have uploaded wrong files in the previous comment, which is now edited.

gopikrishnangs44 commented 1 month ago

I am also attaching the slurm script and environment file for your reference gc_spack.txt sbatch_script.txt

gopikrishnangs44 commented 1 month ago

HI @yantosca,

do you find the slurm script file okay to submit the job?

gopikrishnangs44 commented 1 month ago

@yantosca

I used the Bufferzone_NSEW as [1,1,1,1], which should be [3,3,3,3], breaking the TPCORE advection scheme for a global nested grid.

Thank you for the responses.