[BUG/ISSUE] Nested run on 14.0.0-rc.2 on AWS

arianatribby commented 1 year ago

What institution are you from?

Caltech

Description of the problem

It is my first time attempting a nested simulation. I created a run directory specifically for a nested simulation and also edited the geoschem_config.yml . When the run starts to read HISTORY.rc, it stops without any errors. I changed verbose/warning to 3 in HEMCO_Config.rc and also debug_printout on geoschem_config.yml to true, but no additional errors or comments are printed out where the run stops. Here is the output to the terminal during the run right before the run stops:

$$ Finished Reading Linoz Data $$

HISTORY (INIT): Opening ./HISTORY.rc
(base) ubuntu@ip-172-31-65-111:

Description of troubleshooting performed

Since there is no error from GEOS-Chem, perhaps it is on the software or hardware end. I have a cx5.9 instance (36 cpu and 72 Gb memory plus 400 Gb storage). I don't see a problem with running out of cpu or memory on aws cloudwatch. However, I am using ami-0491da4eeba0fe986 - is this the appropriate AMI for this run?

GEOS-Chem version

14.0.0-rc.2 AWS

Description of modifications

Here are the edits to config files for the nested run:

#============================================================================
# Simulation settings
#============================================================================
simulation:
  name: fullchem
  start_date: [20160701, 010000]
  end_date: [20160701, 030000]
  root_data_dir: /home/ubuntu/ExtData
  met_field: MERRA2
  species_database_file: ./species_database.yml
  debug_printout: true
  use_gcclassic_timers: false

#============================================================================
# Grid settings
#============================================================================
grid:
  resolution: 0.5x0.625
  number_of_levels: 72
  longitude:
    range: -180.0 10.0
    center_at_180: true
  latitude:
    range: 10.0 90.0
    half_size_polar_boxes: true
  nested_grid_simulation:
    activate: true
    buffer_zone_NSEW: [3, 3, 3, 3]

I followed the steps to create the run directory for nested using this link: https://geos-chem.readthedocs.io/en/latest/gcc-guide/02-build/rundir-fullchem.html?highlight=nested

(So I selected 0.5x0.625 resolution during the run dir creation, then selected custom grid domain).

Log files

run.log.txt HISTORY.rc.txt HEMCO_Config.rc.txt geoschem_config.yml.txt

Software versions

CMake version:
Compilers (Intel or GNU, and version):
NetCDF version:

yantosca commented 1 year ago

HI @arianatribby, thanks for writing. Did your run drop a core file (such as core.12345; where 12345 is the process ID). If so, then you can type

$ gdb gcclassic core.12345

and that will open the GDB debugger and take you to the place where it stopped. You can also try running the code in the gdb debugger, type:

$gdb gcclassic

and then at the GDB command line, type run. Then you can type where to see where the code stopped.

I wonder if there was a leftover STOP command from debugging in the code you grabbed. Luckily, we are about to release the official 14.0.0 version very soon, so then you can try with that version too.

arianatribby commented 1 year ago

Thanks! I decided to try this with v13.4.1 (with c5x9, 500Gb storage), and I get the same exact error. I tried your suggestion with gdb, here is the output:

$$ Finished Reading Linoz Data $$

HISTORY (INIT): Opening ./HISTORY.rc

Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) where
No stack.
(gdb)

arianatribby commented 1 year ago

Forgot to include relevant files:

run.log.txt input.geos.txt HISTORY.rc.txt HEMCO_Config.rc.txt

yantosca commented 1 year ago

Thanks @arianatribby. Usually when you get a SIGKILL error this means that the Linux kernel issued a stop command to your program. This can happen if you are running out of memory (see: https://stackoverflow.com/questions/12288550/c-linux-binary-terminated-with-signal-sigkill-why-loaded-in-gdb).

Have you tried using less cores with the same AMI? The more cores you use the more stack memory will be required. Try with half the number of cores and see if your run gets through.

Also if your run dies with SIGKILL again, you can type dmesg to get an error message. That might help.

yantosca commented 1 year ago

Also tagging @Jourdan-He for reference.

arianatribby commented 1 year ago

Thanks @yantosca for your help. I changed the instance to r5 family (ratio 4GB/1CPU) and I got past this issue. I assumed the nested would require similar memory compared to the fine resolution runs.

Now, I am running into negative values for ACET before there is an error in mixing_mod.F90. I've done a year of spin-up but only recently decided to try the nested and didn't generate boundary diagnostics when I ran the spin up. But I used the restart file from the year spin up for the run that generated boundary condition diagnostics. I compared the max/min of ACET for the restart file (from the year spin-up) and the max/min is very similar to that of the boundary condition file.

I have read several other solutions from issue #1138 and #969 , and tried both things: I checked the boundary conditions for ACET and they are without negative or nan values and I also tried reducing the time step in input.geos to 200/400 from 300/600, but that did not help.

The production run is only for 2 hours from 2016070103 - 2016070105. When I initially generated the boundary conditions, I did a 4x5 run from 2016070100 - 2016070103, so I use the file of 2016070103. Do you think the really short time period has caused a problem? I am aware the boundary condition file is read in every 3 hours, so that is why the initial run was from 00-03, and why the production run is 03-05, but maybe the 4x5 run from 00-03 was not enough time to generate a good enough file?

Here is the error:

######################################################################[355/1834]
# Interpolating Linoz fields for jul                                            
############################################################################### 
     - LINOZ_CHEM3: Doing LINOZ                                                 
=============================================================================== 
Successfully initialized ISORROPIA code II                                      
=============================================================================== 
---> DATE: 2016/07/01  UTC: 03:03  X-HRS:      0.055556                         
Min and Max of each species in BC file [mol/mol]:
GET_BOUNDARY_CONDITIONS: Done applying BCs at 2016/07/01 03:03
---> DATE: 2016/07/01  UTC: 03:06  X-HRS:      0.111111
     - DO_LINEAR_CHEM: Linearized chemistry at 2016/07/01 03:06
     - LINOZ_CHEM3: Doing LINOZ
---> DATE: 2016/07/01  UTC: 03:10  X-HRS:      0.166667
 WARNING: Negative concentration for species ACET at (I,J,L) =          303     
    155           2
 WARNING: Negative concentration for species ACET at (I,J,L) =          303     
    155           3
 WARNING: Negative concentration for species ACET at (I,J,L) =          303     
    155           4
 WARNING: Negative concentration for species ACET at (I,J,L) =          303     
    155           5

and it does that for a long time, then

===============================================================================
GEOS-Chem ERROR:
 -> at DO_TEND (in module GeosCore/mixing_mod.F90)
===============================================================================
===============================================================================
GEOS-Chem ERROR: Error encountred in "DO_TEND"!
 -> at DO_MIXING (in module GeosCore/mixing_mod.F90)
===============================================================================
===============================================================================
GEOS-CHEM ERROR: Error encountered in "Do_Mixing"!
STOP at  -> at GEOS-Chem (in GeosCore/main.F90)
===============================================================================

Here are some relevant files: run.log.txt input.geos.txt HEMCO.log.txt HISTORY.rc.txt HEMCO_Config.rc.gmao_metfields.txt HEMCO_Config.rc.txt

Thanks so much for your help.

yantosca commented 1 year ago

Also tagging @SaptSinha for reference

yantosca commented 1 year ago

Thanks for the feedback @arianatribby. Glad the memory issue is sorted out with the larger node. I have used the c5.*xlarge instances but maybe the r5 has better memory per CPU. (I also don't run nested on the cloud very much other than for testing.)

Now, I am running into negative values for ACET before there is an error in mixing_mod.F90. I've done a year of spin-up but only recently decided to try the nested and didn't generate boundary diagnostics when I ran the spin up. But I used the restart file from the year spin up for the run that generated boundary condition diagnostics. I compared the max/min of ACET for the restart file (from the year spin-up) and the max/min is very similar to that of the boundary condition file.

I would think that would be fine. The boundary conditions are just instantaneous concentrations just like those in the restart file.

Did you use the restart file from the global run at 2016070103 to start the nested simulation? If not, then try that. That way you'll have a consistent record -- HEMCO will regrid the file to your nested domain.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.

yantosca commented 1 year ago

Closing out this issue

geoschem / geos-chem