geoschem / geos-chem

GEOS-Chem "Science Codebase" repository. Contains GEOS-Chem science routines, run directory generation scripts, and interface code. This repository is used as a submodule within the GCClassic and GCHP wrappers, as well as in other modeling contexts (external ESMs).
http://geos-chem.org
Other
171 stars 166 forks source link

SLURM script returning segmentation fault error and missing HEMCO data #2454

Open bcraig99 opened 2 months ago

bcraig99 commented 2 months ago

Your name

Broderik Craig

Your affiliation

University of Utah

Please provide a clear and concise description of your question or discussion topic.

I'm running a full chemistry simulation from 2018/12/01-2019/02/01 in longitude range -112.30457 -111.603284 and latitude range 39.97557 41.52831.

My log file contains the following errors:

HEMCO ERROR: Cannot find file for current simulation time: path/to/ExtData/HEMCO/SAMPLE_BCs/GC_14.3.0/fullchem/GEOSChem.BoundaryConditions.20181201_0000z.nc4 - Cannot get field BC_ACET. Please check file name and time (incl. time range flag) in the  - Cannot get field BC_ACET. Please check file name and time (incl. time range flag) in the config. file./download_data.py log.dryrun --washuconfig. file./download_data.py log.dryrun --washu 

HEMCO ERROR: Error encountered in routine HCOIO_Read!

HEMCO ERROR: Error in HCOIO_DATAREAD called from HEMCO ReadList_Fill: BC_ACET
 --> LOCATION: ReadList_Fill (HCO_ReadList_Mod.F90)

HEMCO ERROR: Error in ReadList_Fill (4) called from HEMCO ReadList_Read
 --> LOCATION: ReadList_Read (HCO_ReadList_Mod.F90)
 Error in ReadList_Read called from hco_run
===============================================================================
GEOS-Chem ERROR: Error encountered in "HCO_Run"!
 -> at HCOI_GC_Run (in module GeosCore/hco_interface_gc_mod.F90)

THIS ERROR ORIGINATED IN HEMCO!  Please check the HEMCO log file for
additional error messages!
===============================================================================

and my slurm error log returns

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7fc87eee7171 in ???
#1  0x7fc87eee6313 in ???
#2  0x7fc87df1eb4f in ???
#3  0x7fc87f0e4d4d in ???
#4  0x56ee7c in __hco_interface_gc_mod_MOD_hcoi_gc_run
        at HOME/gc_05x0625_CU_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/hco_interface_gc_mod.F90:1005
#5  0x4cfbd7 in __emissions_mod_MOD_emissions_run
        at HOME/gc_05x0625_CU_merra2_fullchem/CodeDir/src/GEOS-Chem/GeosCore/emissions_mod.F90:184
#6  0x406e62 in geos_chem
        at HOME/gc_05x0625_CU_merra2_fullchem/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:667
#7  0x40c2a7 in main
        at HOME/gc_05x0625_CU_merra2_fullchem/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:32
real 317.56
user 233.86
sys 14.38
srun: error: lp103: task 0: Exited with exit code 139

and after checking the data download output I have many instances that look like this

--2024-09-06 12:23:06--  http://geoschemdata.wustl.edu/ExtData/HEMCO/SAMPLE_BCs/GC_14.3.0/fullchem/GEOSChem.BoundaryConditions.20190127_0000z.nc4
Resolving geoschemdata.wustl.edu (geoschemdata.wustl.edu)... 35.209.233.133
Connecting to geoschemdata.wustl.edu (geoschemdata.wustl.edu)|35.209.233.133|:80...
 connected.
HTTP request sent, awaiting response... 404 Not Found
2024-09-06 12:23:07 ERROR 404: Not Found.

--2024-09-06 12:23:07--  http://geoschemdata.wustl.edu/ExtData/HEMCO/SAMPLE_BCs/GC_14.3.0/fullchem/GEOSChem.BoundaryConditions.20190128_0000z.nc4
Resolving geoschemdata.wustl.edu (geoschemdata.wustl.edu)... 35.209.233.133
Connecting to geoschemdata.wustl.edu (geoschemdata.wustl.edu)|35.209.233.133|:80...
 connected.
HTTP request sent, awaiting response... 404 Not Found
2024-09-06 12:23:07 ERROR 404: Not Found.

--2024-09-06 12:23:07--  http://geoschemdata.wustl.edu/ExtData/HEMCO/SAMPLE_BCs/GC_14.3.0/fullchem/GEOSChem.BoundaryConditions.20190129_0000z.nc4
Resolving geoschemdata.wustl.edu (geoschemdata.wustl.edu)... 35.209.233.133
Connecting to geoschemdata.wustl.edu (geoschemdata.wustl.edu)|35.209.233.133|:80...
 connected.
HTTP request sent, awaiting response... 404 Not Found
2024-09-06 12:23:07 ERROR 404: Not Found.

--2024-09-06 12:23:07--  http://geoschemdata.wustl.edu/ExtData/HEMCO/SAMPLE_BCs/GC_14.3.0/fullchem/GEOSChem.BoundaryConditions.20190130_0000z.nc4
Resolving geoschemdata.wustl.edu (geoschemdata.wustl.edu)... 35.209.233.133
Connecting to geoschemdata.wustl.edu (geoschemdata.wustl.edu)|35.209.233.133|:80...
 connected.
HTTP request sent, awaiting response... 404 Not Found
2024-09-06 12:23:07 ERROR 404: Not Found.

I'm solidly stumped, any feedback is appreciated.

yantosca commented 2 months ago

Thanks for writing @bcraig99. The HEMCO/SAMPLE_BCs/GC_14.3.0/fullchem/ folder contains a single boundary condition file that we use for running integration tests on the nested-grid models. I believe it only contains 1 day of data (or maybe even less, I haven't checked it in a while). What is probably happening is that your simulation has moved beyond the last time in the boundary conditions file, and thus has thrown an error.

If you plan on doing a nested-grid simulation, you must first run a global simulation in order to save out boundary conditions (frequency: 3hrs, duration: 24hrs) that will be applied at the edges of your nested domain. We have instructions on how to do this on ReadTheDocs:

BTW, a Segmentation Fault means tried to access a memory element that does not exist. It can be a side-effect of exiting a simulation with an error. For more information about this and other types of errors, see our documentation at:

bcraig99 commented 2 months ago

Thanks @yantosca! I can see where it went wrong. I'm new to GEOS-Chem, what is the difference between running a nested-grid model vs something else?

yantosca commented 2 months ago

Thanks @bcraig99. A nested-grid model is when you run for a small window of the world in GEOS-Chem Classic, as opposed to a global simulation. With a nested-grid simulation you can run at very fine resolution (0.25 x 0.3125 degree or 0.5 x 0.625 degree). Because it is computationally intensive to run at fine resolution, the trade-off is to only run over a region of the globe that you are interested in.

If you are new to GEOS-Chem I would recommend to read through the https://geos-chem.readthedocs.io manual since that goes into great detail about the options you can use with GEOS-Chem.

Thanks again and happy modeling!

bcraig99 commented 2 months ago

I created a gc_4x5_merra2_fullchem simulation. I followed the instructions up to step 5 here https://geos-chem.readthedocs.io/en/latest/supplemental-guides/nested-grid-guide.html. After running the geoschem program I get this

Getting CH4 boundary conditions in GEOS-Chem from :NOAA_GMD_CH4
HEMCO (VOLCANO): Opening /path/to/ExtData/HEMCO/VOLCANO/v2024-04/2019/07/so2_volcanic_emissions_Carns.20190701.rc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 85032 on node notch081 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
yantosca commented 2 months ago

Thanks for writing @bcraig99. You should not try to use mpirun to run GEOS-Chem Classic. That might be the cause of your error. mpirun would only be needed if you run GCHP.

bcraig99 commented 2 months ago

What would you reccomend using to run GEOS-Chem Classic?

lizziel commented 2 months ago

Hi @bcraig99, what command are you using to run GEOS-Chem? Have you tried compiling with debug flags and enabling maximum prints to log? We have a debug guide on ReadTheDocs that goes over all the strategies to figure out what is going wrong. See https://geos-chem.readthedocs.io/en/stable/geos-chem-shared-docs/supplemental-guides/debug-guide.html.

bcraig99 commented 2 months ago

mpirun -np 1 ./gcclassic | tee GC.log

I abandoned a fullchem model and started running just an aerosol model. The global simulation seemed to run fine for my boundary condition files, but my nested grid simulation works until the last day of the simulation where it throws the following error

 ---> DATE: 2019/07/31  UTC: 23:55
     - Creating file for Aerosols; reference = 20190701 000000
        with filename = OutputDir/GEOSChem.Aerosols.20190701_0000z.nc4
     - Creating file for AerosolMass; reference = 20190701 000000
        with filename = OutputDir/GEOSChem.AerosolMass.20190701_0000z.nc4
     - Creating file for SpeciesConc; reference = 20190701 000000
        with filename = OutputDir/GEOSChem.SpeciesConc.20190701_0000z.nc4
     - Creating file for Restart; reference = 20190801 000000
        with filename = ./Restarts/GEOSChem.Restart.20190801_0000z.nc4
---> DATE: 2019/08/01  UTC: 00:00
 GET_BOUNDARY_CONDITIONS: Done reading BCs at 2019/08/01 00:00 using            0           1
corrupted size vs. prev_size

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7f3b2ed39171 in ???
#1  0x7f3b2ed38313 in ???
#2  0x7f3b2dd70b4f in ???
#3  0x7f3b2dd70acf in ???
#4  0x7f3b2dd43ea4 in ???
#5  0x7f3b2ddb1cd6 in ???
#6  0x7f3b2ddb8fdb in ???
#7  0x7f3b2ddb9885 in ???
#8  0x7f3b2ddbaf0a in ???
#9  0x90e940 in __phot_container_mod_MOD_cleanup_phot_container
        at path/to/gc_05x0625_CU_merra2_aerosol/CodeDir/src/GEOS-Chem/Headers/phot_container_mod.F90:732
#10  0x8968fe in __state_chm_mod_MOD_cleanup_state_chm
        at path/to/gc_05x0625_CU_merra2_aerosol/CodeDir/src/GEOS-Chem/Headers/state_chm_mod.F90:3087
#11  0x407863 in geos_chem
        at path/to/gc_05x0625_CU_merra2_aerosol/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:1983
#12  0x404566 in main
        at path/to/gc_05x0625_CU_merra2_aerosol/CodeDir/src/GEOS-Chem/Interfaces/GCClassic/main.F90:32
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3205599 on node notch081 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

yantosca commented 1 week ago

Thanks @bcraig99. Sorry for the late reply.

There is a techy description of the corrupted size vs. prev_size at this Stack Overflow post. TL;DR: It can be caused by an out-of-bounds error in an array that is being deallocated. This causes a memory leak which triggers the abort signal.

You can try reconfiguring with cmake -DCMAKE_RELEASE_TYPE=Debug ...etc..., which will turn on array bounds checking (among other debug options). This will stop the run if an array goes out of bounds.

We would also suggest migrating to GEOS-Chem 14.5.0, which uses the most recent version of Cloud-J photolysis (as this is where the error was).