Closed arianatribby closed 1 year ago
HI @arianatribby, thanks for writing. Did your run drop a core file (such as core.12345; where 12345 is the process ID). If so, then you can type
$ gdb gcclassic core.12345
and that will open the GDB debugger and take you to the place where it stopped. You can also try running the code in the gdb debugger, type:
$gdb gcclassic
and then at the GDB command line, type run
. Then you can type where
to see where the code stopped.
I wonder if there was a leftover STOP
command from debugging in the code you grabbed. Luckily, we are about to release the official 14.0.0 version very soon, so then you can try with that version too.
Thanks! I decided to try this with v13.4.1 (with c5x9, 500Gb storage), and I get the same exact error. I tried your suggestion with gdb, here is the output:
$$ Finished Reading Linoz Data $$
HISTORY (INIT): Opening ./HISTORY.rc
Program terminated with signal SIGKILL, Killed.
The program no longer exists.
(gdb) where
No stack.
(gdb)
Forgot to include relevant files:
run.log.txt input.geos.txt HISTORY.rc.txt HEMCO_Config.rc.txt
Thanks @arianatribby. Usually when you get a SIGKILL error this means that the Linux kernel issued a stop command to your program. This can happen if you are running out of memory (see: https://stackoverflow.com/questions/12288550/c-linux-binary-terminated-with-signal-sigkill-why-loaded-in-gdb).
Have you tried using less cores with the same AMI? The more cores you use the more stack memory will be required. Try with half the number of cores and see if your run gets through.
Also if your run dies with SIGKILL again, you can type dmesg
to get an error message. That might help.
Also tagging @Jourdan-He for reference.
Thanks @yantosca for your help. I changed the instance to r5 family (ratio 4GB/1CPU) and I got past this issue. I assumed the nested would require similar memory compared to the fine resolution runs.
Now, I am running into negative values for ACET before there is an error in mixing_mod.F90. I've done a year of spin-up but only recently decided to try the nested and didn't generate boundary diagnostics when I ran the spin up. But I used the restart file from the year spin up for the run that generated boundary condition diagnostics. I compared the max/min of ACET for the restart file (from the year spin-up) and the max/min is very similar to that of the boundary condition file.
I have read several other solutions from issue #1138 and #969 , and tried both things: I checked the boundary conditions for ACET and they are without negative or nan values and I also tried reducing the time step in input.geos to 200/400 from 300/600, but that did not help.
The production run is only for 2 hours from 2016070103 - 2016070105. When I initially generated the boundary conditions, I did a 4x5 run from 2016070100 - 2016070103, so I use the file of 2016070103. Do you think the really short time period has caused a problem? I am aware the boundary condition file is read in every 3 hours, so that is why the initial run was from 00-03, and why the production run is 03-05, but maybe the 4x5 run from 00-03 was not enough time to generate a good enough file?
Here is the error:
######################################################################[355/1834]
# Interpolating Linoz fields for jul
###############################################################################
- LINOZ_CHEM3: Doing LINOZ
===============================================================================
Successfully initialized ISORROPIA code II
===============================================================================
---> DATE: 2016/07/01 UTC: 03:03 X-HRS: 0.055556
Min and Max of each species in BC file [mol/mol]:
GET_BOUNDARY_CONDITIONS: Done applying BCs at 2016/07/01 03:03
---> DATE: 2016/07/01 UTC: 03:06 X-HRS: 0.111111
- DO_LINEAR_CHEM: Linearized chemistry at 2016/07/01 03:06
- LINOZ_CHEM3: Doing LINOZ
---> DATE: 2016/07/01 UTC: 03:10 X-HRS: 0.166667
WARNING: Negative concentration for species ACET at (I,J,L) = 303
155 2
WARNING: Negative concentration for species ACET at (I,J,L) = 303
155 3
WARNING: Negative concentration for species ACET at (I,J,L) = 303
155 4
WARNING: Negative concentration for species ACET at (I,J,L) = 303
155 5
and it does that for a long time, then
===============================================================================
GEOS-Chem ERROR:
-> at DO_TEND (in module GeosCore/mixing_mod.F90)
===============================================================================
===============================================================================
GEOS-Chem ERROR: Error encountred in "DO_TEND"!
-> at DO_MIXING (in module GeosCore/mixing_mod.F90)
===============================================================================
===============================================================================
GEOS-CHEM ERROR: Error encountered in "Do_Mixing"!
STOP at -> at GEOS-Chem (in GeosCore/main.F90)
===============================================================================
Here are some relevant files: run.log.txt input.geos.txt HEMCO.log.txt HISTORY.rc.txt HEMCO_Config.rc.gmao_metfields.txt HEMCO_Config.rc.txt
Thanks so much for your help.
Also tagging @SaptSinha for reference
Thanks for the feedback @arianatribby. Glad the memory issue is sorted out with the larger node. I have used the c5.*xlarge instances but maybe the r5 has better memory per CPU. (I also don't run nested on the cloud very much other than for testing.)
Now, I am running into negative values for ACET before there is an error in mixing_mod.F90. I've done a year of spin-up but only recently decided to try the nested and didn't generate boundary diagnostics when I ran the spin up. But I used the restart file from the year spin up for the run that generated boundary condition diagnostics. I compared the max/min of ACET for the restart file (from the year spin-up) and the max/min is very similar to that of the boundary condition file.
I would think that would be fine. The boundary conditions are just instantaneous concentrations just like those in the restart file.
Did you use the restart file from the global run at 2016070103 to start the nested simulation? If not, then try that. That way you'll have a consistent record -- HEMCO will regrid the file to your nested domain.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the Stale bot from closing this issue.
Closing out this issue
What institution are you from?
Caltech
Description of the problem
It is my first time attempting a nested simulation. I created a run directory specifically for a nested simulation and also edited the geoschem_config.yml . When the run starts to read HISTORY.rc, it stops without any errors. I changed verbose/warning to 3 in HEMCO_Config.rc and also debug_printout on geoschem_config.yml to true, but no additional errors or comments are printed out where the run stops. Here is the output to the terminal during the run right before the run stops:
Description of troubleshooting performed
Since there is no error from GEOS-Chem, perhaps it is on the software or hardware end. I have a cx5.9 instance (36 cpu and 72 Gb memory plus 400 Gb storage). I don't see a problem with running out of cpu or memory on aws cloudwatch. However, I am using ami-0491da4eeba0fe986 - is this the appropriate AMI for this run?
GEOS-Chem version
14.0.0-rc.2 AWS
Description of modifications
Here are the edits to config files for the nested run:
I followed the steps to create the run directory for nested using this link: https://geos-chem.readthedocs.io/en/latest/gcc-guide/02-build/rundir-fullchem.html?highlight=nested
(So I selected 0.5x0.625 resolution during the run dir creation, then selected custom grid domain).
Log files
run.log.txt HISTORY.rc.txt HEMCO_Config.rc.txt geoschem_config.yml.txt
Software versions