geoschem / GCHP

The "superproject" wrapper repository for GCHP, the high-performance instance of the GEOS-Chem chemical-transport model.
https://gchp.readthedocs.io
Other
22 stars 25 forks source link

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37

Closed joeylamcy closed 3 years ago

joeylamcy commented 4 years ago

Hi everyone,

I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with forrtl: error (73): floating divide by zero. The full log files are attached below. 163214_print_out.log 163214_error.log

More information:

I'm not sure how to troubleshoot this issue. I tried to cmake the source code with -DCMAKE_BUILD_TYPE=Debug (with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:

forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write
Image              PC                Routine            Line        Source
geos               00000000094A364E  Unknown               Unknown  Unknown
geos               00000000094F8D62  Unknown               Unknown  Unknown
geos               00000000094F6232  Unknown               Unknown  Unknown
geos               000000000226CC73  advcore_gridcompm         261  AdvCore_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos               0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos               0000000007F00B39  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               000000000844804D  Unknown               Unknown  Unknown
geos               0000000007EE2A0F  Unknown               Unknown  Unknown
geos               0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos               0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos               0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos               0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos               0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos               00000000004242FF  MAIN__                     29  GCHPctm.F90
geos               000000000042125E  Unknown               Unknown  Unknown
geos               000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002AFBC9F34505  __libc_start_main     Unknown  Unknown
geos               0000000000421169  Unknown               Unknown  Unknown

I also noticed something weird towards the start of the run:

      MAPL: No configure file specified for logging layer.  Using defaults. 
     SHMEM: NumCores per Node = 6
     SHMEM: NumNodes in use   = 1
     SHMEM: Total PEs         = 6
     SHMEM: NumNodes in use  = 1

Previous versions (12.8.2) usually shows this instead:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6

 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1

but I'm not sure if that matters.

LiamBindle commented 3 years ago

Thanks, I can see them now! I'll follow up by the end of my day.

LiamBindle commented 3 years ago

Thanks @joeylamcy. Yeah this is interesting. It looks like something is going wrong with the lat and lon coordinates in the diagnostics (the rest of the diagnostic look okay). I suspect something subtle is happening in HISTORY, and so I've opened GEOS-ESM/MAPL#579.

lizziel commented 3 years ago

Hi @joeylamcy. Apologies if this has already been asked, but have you used your current environment successfully with older versions of GCHP?

joeylamcy commented 3 years ago

Yes. A very similar environment with minor changes is successful with GCHP v12.9.3. I'm able to obtain normal outputs of ozone concentration on a lat-lon grid.

lizziel commented 3 years ago

We have several versions of MAPL spread across the 13.0.0-alpha series. Would you be able to try GCHPctm 13.0.0-alpha.7? This uses an earlier version of MAPL than alpha.9 and would help determine if the problem came in with that update. Alpha versions that included updating to a new MAPL were 1, 3, 6, and 8.

joeylamcy commented 3 years ago

I have tried alpha.5 and alpha.7 thus far, but the issue exists for both versions.

joeylamcy commented 3 years ago

I have also tried alpha.1 today and the issue exists as well. I can't get the cmake part going for alpha.0. Do you think it is important to test on alpha.0?

LiamBindle commented 3 years ago

Hi Joey, thanks for checking that. I don't think it's important you check alpha.0. Knowing you see it in alpha.1 tells us this goes back a while, and that it hasn't been introduced recently. We've had some discussion about this internally and with the MAPL developers, and we're pretty stumped on what could be happening.

Since it appears the integrity of your diagnostic's variables is okay, I'd suggest you proceed with your GCHP simulations. Afterwards, recalculate the grid's coordinates for whereever you need them (e.g., for plotting). The easiest way to do this is with GCPy. The latest GCPy (dev/1.0.0 branch) has a command line tool for adding grid-box corners to an existing diagnostic file. For example, to add the corner coordinates to a diagnostic named GCHP.SpeciesConc.20180101z.nc4 you would do

$ python -m gcpy.append_grid_corners GCHP.SpeciesConc.20180101z.nc4

This adds the variables corner_lats and corner_lons to your dataset. You can use these new coordinates for plotting your data. Alternatively, if you use GCPy for plotting, it calculates the grid coordinates internally anyways, so you won't even need to do this step.

In GCHP 13.1.0, these corner coordinates are going to be included in the diagnostics automatically. You actually usually need these corner coordinates to plot GCHP data anyways (since it's a curvlinear grid, center coordinates aren't sufficient for plotting the data), which is why many of us (including GCPy) calculate these corner coordinates offline.

I know this is a bit unsatisfactory, but I think this is the best way to proceed. Let me know if you have any questions.

LiamBindle commented 3 years ago

@joeylamcy I'm going to close this. Please don't hesitate to open new issues if you run into problems/questions!

zsx-GitHub commented 2 years ago

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

@lizziel Hi Lizzie and all, I recently run into a similar issue. I enforced particulate matter concentration in GEOS-Chem to be at the measured value in the lowest 8 layers. After my revisions, the model can run smoothly if all HISTORY COLLECTIONS are turned off. The model will stop with 'forrtl: error (73): floating divide by zero' if any of the HISTORY COLLECTIONS (eg: SpeciesConc) is turned on. Does anyone have any ideas on how to fix this? Thank you!

Below I am pasting the out messages in slurm***.out:

forrtl: error (73): floating divide by zero Image PC Routine Line Source gcclassic 00000000011699CF Unknown Unknown Unknown libpthread-2.17.s 00002ADA281F5630 Unknown Unknown Unknown libnetcdf.so.7.2. 00002ADA27881D9C NC4_def_var Unknown Unknown libnetcdf.so.7.2. 00002ADA27806FCB nc_def_var Unknown Unknown libnetcdff.so.6.0 00002ADA2717A70C nf_defvar Unknown Unknown gcclassic 0000000000FD5865 m_netcdf_io_defin 203 m_netcdf_io_define.F90 gcclassic 0000000001010914 ncdf_mod_mp_nc_va 3816 ncdf_mod.F90 gcclassic 000000000099AB0E history_netcdf_mo 465 history_netcdf_mod.F90 gcclassic 00000000009970C7 history_mod_mp_hi 2885 history_mod.F90 gcclassic 0000000000412E39 MAIN 2082 main.F90 gcclassic 000000000040C11E Unknown Unknown Unknown libc-2.17.so 00002ADA28424555 libc_start_main Unknown Unknown gcclassic 000000000040C029 Unknown Unknown Unknown /var/slurmd/spool/slurmd/job45642322/slurm_script: line 23: 94051 Aborted (core dumped) ./gcclassic - > $log

real 6m44.986s user 104m8.109s sys 1m46.121s

lizziel commented 2 years ago

@zsx-GitHub, please see new issue https://github.com/geoschem/geos-chem/issues/917 which I created for your issue.