GEOS-ESM / ESMA-Baselibs

Base Libraries for the GEOS ESM
Apache License 2.0
1 stars 10 forks source link

Error raised by Zoltan_Malloc when running GEOSgcm #68

Closed gmao-cda closed 2 years ago

gmao-cda commented 2 years ago

Hi Matt @mathomp4 , Thank you very much for preparing me a test case to run on the ESSIC server! @sanAkel kindly taught me how to make a test run with the GEOSgcm this morning, but My GEOsgcm exited with error when I use 6 processes to run the GEOsgcm. Could you give me some suggestions? Thank you!

Building env:

The same env as shown in (https://github.com/GEOS-ESM/GEOSgcm/issues/446), which is

Running command

(cda_suite) [cda@measures1 scratch]$ /data2/cda/pkg/openmpi-4.1.4-gcc12.1.0/bin/mpirun  -np 6 ./GEOSgcm.x

Error message

 EXTDATA: Updating R bracket for TR_LAI_FRAC
   EXTDATA:  ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
Zoltan_Malloc (from /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c,89) No space on proc 0 - number of bytes requested = 116655624
[0] Zoltan ERROR in Zoltan_RB_Build_Structure (line 92 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c):  Insufficient memory.
[0] Zoltan ERROR in Zoltan_RCB_Build_Structure (line 91 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb_util.c):  Error returned from Zoltan_RB_Build_Structure.
[0] Zoltan ERROR in rcb_fn (line 440 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb.c):  Error returned from Zoltan_RCB_Build_Structure.
[0] Zoltan ERROR in Zoltan_LB (line 388 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/lb_balance.c):  Partitioning routine returned code -2.
Zoltan_Malloc (from /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c,89) No space on proc 1 - number of bytes requested = 116655624
[1] Zoltan ERROR in Zoltan_RB_Build_Structure (line 92 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c):  Insufficient memory.
[1] Zoltan ERROR in Zoltan_RCB_Build_Structure (line 91 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb_util.c):  Error returned from Zoltan_RB_Build_Structure.
[1] Zoltan ERROR in rcb_fn (line 440 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb.c):  Error returned from Zoltan_RCB_Build_Structure.
[1] Zoltan ERROR in Zoltan_LB (line 388 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/lb_balance.c):  Partitioning routine returned code -2.
Zoltan_Malloc (from /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c,89) No space on proc 2 - number of bytes requested = 116655624
[2] Zoltan ERROR in Zoltan_RB_Build_Structure (line 92 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c):  Insufficient memory.
[2] Zoltan ERROR in Zoltan_RCB_Build_Structure (line 91 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb_util.c):  Error returned from Zoltan_RB_Build_Structure.
[2] Zoltan ERROR in rcb_fn (line 440 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb.c):  Error returned from Zoltan_RCB_Build_Structure.
[2] Zoltan ERROR in Zoltan_LB (line 388 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/lb_balance.c):  Partitioning routine returned code -2.
Zoltan_Malloc (from /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c,89) No space on proc 3 - number of bytes requested = 116655624
[3] Zoltan ERROR in Zoltan_RB_Build_Structure (line 92 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c):  Insufficient memory.
[3] Zoltan ERROR in Zoltan_RCB_Build_Structure (line 91 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb_util.c):  Error returned from Zoltan_RB_Build_Structure.
[3] Zoltan ERROR in rcb_fn (line 440 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb.c):  Error returned from Zoltan_RCB_Build_Structure.
[3] Zoltan ERROR in Zoltan_LB (line 388 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/lb_balance.c):  Partitioning routine returned code -2.
Zoltan_Malloc (from /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c,89) No space on proc 4 - number of bytes requested = 116655624
[4] Zoltan ERROR in Zoltan_RB_Build_Structure (line 92 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c):  Insufficient memory.
[4] Zoltan ERROR in Zoltan_RCB_Build_Structure (line 91 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb_util.c):  Error returned from Zoltan_RB_Build_Structure.
[4] Zoltan ERROR in rcb_fn (line 440 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb.c):  Error returned from Zoltan_RCB_Build_Structure.
[4] Zoltan ERROR in Zoltan_LB (line 388 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/lb_balance.c):  Partitioning routine returned code -2.
Zoltan_Malloc (from /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c,89) No space on proc 5 - number of bytes requested = 116655624
[5] Zoltan ERROR in Zoltan_RB_Build_Structure (line 92 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/shared.c):  Insufficient memory.
[5] Zoltan ERROR in Zoltan_RCB_Build_Structure (line 91 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb_util.c):  Error returned from Zoltan_RB_Build_Structure.
[5] Zoltan ERROR in rcb_fn (line 440 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/rcb.c):  Error returned from Zoltan_RCB_Build_Structure.
[5] Zoltan ERROR in Zoltan_LB (line 388 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/lb_balance.c):  Partitioning routine returned code -2.
[-1] Zoltan ERROR in Zoltan_RB_Box_Assign (line 77 of /data2/cda/model/ESMA-Baselibs/esmf/src/Infrastructure/Mesh/src/Zoltan/box_assign.c):  No Decomposition Data available; use KEEP_CUTS parameter.
...<repeat [-1] lines infinitely>

My debug trials

The first thing I do is to ulimit -s unlimited, which leads to

vim gcm_run.j   (wd: ~/geos5/test1)
(cda_suite) [cda@measures1 scratch]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 31172
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Then I rerun the geosgcm, and the error about Zoltan does not show anymore, but the program still stops at

 EXTDATA: Updating R bracket for TR_LAI_FRAC
   EXTDATA:  ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc

My guess

  1. Consider my machine only has 8 GB memory, is the geosgcm failure related to this out of memory based on Zoltan's error message?
  2. Is there a way to allow me change the # of processors to run geosgcm, so the total running memory can be reduced?

Full log

You can find Full log through this dropbox link here: test1_err.log

gmao-cda commented 2 years ago

Log after running makeoneday.sh

[cda@measures1 test2]$ /data2/cda/work/TinyBCs-GitV10/scripts/makeoneday.bash
Using Git v10 directories
Making one-day experiment
 TINY:      TRUE

/data2/cda/work/TinyBCs-GitV10/scripts/makeoneday.bash: line 792: type: colordiff: not found
Restoring AGCM.rc.save to AGCM.rc...
Copying AGCM.rc to AGCM.rc.save...
Restoring CAP.rc.save to CAP.rc...
Copying CAP.rc to CAP.rc.save...
Restoring input.nml.save to input.nml...
Copying input.nml to input.nml.save...
Restoring MOM_input.save to MOM_input...
Copying MOM_input to MOM_input.save...
Restoring HISTORY.rc.save to HISTORY.rc...
Copying HISTORY.rc to HISTORY.rc.save...
Restoring gcm_run.j.save to gcm_run.j...
Copying gcm_run.j to gcm_run.j.save...
Restoring regress/gcm_regress.j.save to regress/gcm_regress.j...
Copying regress/gcm_regress.j to regress/gcm_regress.j.save...
Restoring RC/GEOS_ChemGridComp.rc.save to RC/GEOS_ChemGridComp.rc...
Copying RC/GEOS_ChemGridComp.rc to RC/GEOS_ChemGridComp.rc.save...
Restoring RC/GAAS_GridComp.rc.save to RC/GAAS_GridComp.rc...
Copying RC/GAAS_GridComp.rc to RC/GAAS_GridComp.rc.save...
DYN_INTERNAL_RESTART_TYPE not found. Assuming NC4
Found fvcore_internal_rst. Assuming you have needed restarts!
Changes made to CAP.rc:
9,10c9,10
< JOB_SGMT:     00000015 000000
< NUM_SGMT:     20
---
> JOB_SGMT:     00000001 000000
> NUM_SGMT:     1
25c25
< MAPL_ENABLE_TIMERS: NO
---
> MAPL_ENABLE_TIMERS: YES

Running on unknown nodes with 8 cores per node

        NX from AGCM.rc                 (original): 1
        NY from AGCM.rc                 (original): 6
Num of PEs from AGCM.rc                 (original): 6
Num of nodes from AGCM.rc             (calculated): 1
Num of io nodes from AGCM.rc            (original): 0
Num of nodes from AGCM.rc with ioserver (original): 1

Final number of nodes with ioserver   (calculated): 1
Using minimal boundary datasets
Changes made to gcm_run.j:
7c7
< #SBATCH --time=12:00:00
---
> #SBATCH --time=0:15:00
11a12
> #SBATCH --mail-type=ALL
293,294c294,295
< setenv BCSDIR    /ford1/share/gmao_SIteam/ModelData/bcs/Icarus-NLv3/Icarus-NLv3_Reynolds
< setenv CHMDIR    /ford1/share/gmao_SIteam/ModelData/fvInput_nc3
---
> setenv BCSDIR   /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/bcs/Icarus-NLv3
> setenv CHMDIR   /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/chem
299,301c300,302
< setenv ABCSDIR  /ford1/share/gmao_SIteam/ModelData/aogcm/atmosphere_bcs/Icarus-NLv3/MOM6/CF0012x6C_TM0072xTM0036
< setenv OBCSDIR  /ford1/share/gmao_SIteam/ModelData/aogcm/ocean_bcs/MOM6/${OGCM_IM}x${OGCM_JM}
< setenv SSTDIR  /ford1/share/gmao_SIteam/ModelData/aogcm/SST/MERRA2/${OGCM_IM}x${OGCM_JM}
---
> setenv ABCSDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/atmosphere_bcs/Icarus-NLv3/MOM6/CF0012x6C_TM0072xTM0036
> setenv OBCSDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/ocean_bcs/MOM6/72x36
> setenv SSTDIR   /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/sst/MOM6/SST/MERRA2/72x36
318c319
< #/bin/ln -s /ford1/share/gmao_SIteam/ModelData/aogcm/MOM6/DC048xPC025_TM0072xTM0036/DC048xPC025_TM0072xTM0036-Pfafstetter.til tile_hist.data
---
> ##/bin/ln -s /ford1/share/gmao_SIteam/ModelData/aogcm/MOM6/DC048xPC025_TM0072xTM0036/DC048xPC025_TM0072xTM0036-Pfafstetter.til tile_hist.data
420c421
< if($numrs == 0) then
---
> if($numrs == 1) then
837a839
> exit

Changes made to AGCM.rc:
797c797
<  CLDMICRO: 2MOMENT
---
>  CLDMICRO: 1MOMENT

Changes made to regress/gcm_regress.j:
7c7
< #SBATCH --time=12:00:00
---
> #SBATCH --time=0:20:00

     You seem to be using the MOM6 and portable BCs
     Setting cap_restart to be in 2000

Restoring cap_restart.save to cap_restart...
Copying cap_restart to cap_restart.save...
Changes made to cap_restart:
1c1
< 20210416 000000
---
gmao-cda commented 2 years ago

I tried to run the same test case at my home desktop, which has a 4-core cpu, but still 8GB with ESMA-baselibs=v6.2.14 and GEOSgcm=v10.22.3. Gcc=11.2 and openMPI=4.1.2

But I got the same error as shown on ESSIC server.

   EXTDATA: Updating  bracket for TR_LAI_FRAC
   EXTDATA:  ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node jojo exited on signal 9 (Killed).
--------------------------------------------------------------------------

@sanAkel I'm wondering if you can provide me a log for the successful run so that I know what is the next step after EXTDATA: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc.

If the next step is still EXTDATA:, then the problem is probably on the file ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc

sanAkel commented 2 years ago

@sanAkel I'm wondering if you can provide me a log for the successful run so that I know what is the next step after EXTDATA: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc.

@gmao-cda see attached log file of a healthy run. test5.log

gmao-cda commented 2 years ago

@sanAkel got it. Thanks! Seems the next step of your successful run is

EXTDATA: INFO: Updating L bracket for TR_LAI_FRAC
        EXTDATA: INFO:  ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
        EXTDATA: INFO: Updating R bracket for TR_LAI_FRAC
        EXTDATA: INFO:  ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
 Real*4 Resource Parameter: ALPHA:.000000
 Real*4 Resource Parameter: BETA:1.000000
 Real*4 Resource Parameter: ALPHAQ:.000000
 Real*4 Resource Parameter: BETAQ:1.000000
 Real*4 Resource Parameter: ALPHAO:.000000
 Real*4 Resource Parameter: BETAO:1.000000
 Real*4 Resource Parameter: TAUANL:21600.000000
 Integer*4 Resource Parameter: IS_FCST:0
 Integer*4 Resource Parameter: CONSTRAIN_DAS:1
 Character Resource Parameter: ANA_IS_WEIGHTED:NO
 Character Resource Parameter: EXCLUDE_ADVECTION_TRACERS:ALWAYS
 FV3 is Advecting the following    48 tracers:
sanAkel commented 2 years ago

I don't know if things are ordered in a specific sequence for each and every one of these reads. (We use parallel reads and writes.)

sanAkel commented 2 years ago

@gmao-cda You could perhaps do this:

Edit your CAP.rc file.

Line 7: "JOB_SGMT:" last 6 bits are "hhmmss" formatted.

Try with this:

JOB_SGMT: 00000000 010000 ⬅️ run for 1 hour.

mathomp4 commented 2 years ago

@gmao-cda Let's try running without extdata. You can do:

makeoneday.bash noext

and that will disable extdata.

Note that with 6 cores the model can take a while to get past the ExtData step because some of these files are dumb big now.

sanAkel commented 2 years ago

@gmao-cda you can do both of ⬆️

mathomp4 commented 2 years ago

Note: I have no idea about the Zoltan error. But your ulimit -s unlimited is a good thing to have. Without that, GEOS probably would go nuts.

sanAkel commented 2 years ago

sigh what a model system we have!

mathomp4 commented 2 years ago

If it can run with ExtData off, then I might need to consult other SI Team members and maybe ESMF to see if maybe there is a way to lessen the memory/disk bandwidth/whatever burden at that step.

Of course our ExtData guru is out for 10 days or so, so we might instead selectively turn off Emissions if there are some @sanAkel says aren't needed for your work.

sanAkel commented 2 years ago

Let's see what @gmao-cda finds out ...

@mathomp4 I like the idea of turning off and making our cheap exp, even more light weight!

As is, this resolution is for debugging, PR testing and now, training! For our work, at this resolution, we can totally ignore all emissions!

gmao-cda commented 2 years ago

Thank you very much @mathomp4 and @sanAkel !

With makeoneday.bash noext, it passed location of the original error and start integration Now I can see integration

NOTE from PE     0: callTree:             <--- tracer_hordiff()
NOTE from PE     0: callTree:             o finished tracer advection/diffusion (step_MOM)
NOTE from PE     0: callTree:             o finished calculate_diagnostic_fields (step_MOM)
NOTE from PE     0: callTree:          <--- DT cycles (step_MOM)
NOTE from PE     0: callTree:          o calling extract_surface_state (step_MOM)
NOTE from PE     0: callTree:          ---> extract_surface_state(), MOM.F90
h-point: mean=   9.3292863637473520E+00 min=  -1.8574099561546535E+00 max=   3.0265007406085243E+01 Post extract_sfc SST
h-point: c=     49436 Post extract_sfc SST
h-point: mean=   2.1727010049882391E+01 min=   0.0000000000000000E+00 max=   3.8932200628525912E+01 Post extract_sfc SSS
h-point: c=     44217 Post extract_sfc SSS
h-point: mean=  -1.0647376536857116E-01 min=  -1.1429856567481016E+00 max=   1.1068748770296981E+00 Post extract_sfc sea_lev
h-point: c=     87829 Post extract_sfc sea_lev
h-point: mean=   8.9762258578014347E+00 min=   0.0000000000000000E+00 max=   2.4773197056906201E+02 Post extract_sfc Hml
h-point: c=     44407 Post extract_sfc Hml
u-point: mean=   4.5827435856190596E-03 min=  -1.9871638106354789E-01 max=   2.7519616525679330E-01 u Post extract_sfc SSU
u-point: c=     52478 u Post extract_sfc SSU
v-point: mean=   4.7672146971151656E-03 min=  -1.4410887283895848E-01 max=   2.1435483746411085E-01 v Post extract_sfc SSU
v-point: c=     51062 v Post extract_sfc SSU
h-point: mean=   1.2301660250061899E+02 min=   0.0000000000000000E+00 max=   1.2935082730802389E+05 Post extract_sfc frazil
h-point: c=       307 Post extract_sfc frazil
NOTE from PE     0: callTree:          <--- extract_surface_sfc_state()
NOTE from PE     0: callTree:       <--- step_MOM()
NOTE from PE     0: callTree:    <--- update_ocean_model()

I guess this is OK. During my graduate school, I don't even know NASA put all these running model codes in public. Maybe I can document all these things and write a short user guide.

Then graduate students and researchers interested in using NASA model can build on their own machines with this guide:)

Thank you again!

mathomp4 commented 2 years ago

sigh what a model system we have!

Well, back in the GOCART1 days, we could run with GOCART.data. Unfortunately, GOCART.data is broken in GOCART 2...I think? Not sure, I wasn't sure how to run when we moved.

But if I can figure that out again, that should be a good way to make portable GEOS a bit more portable.

mathomp4 commented 2 years ago

Let's see what @gmao-cda finds out ...

@mathomp4 I like the idea of turning off and making our cheap exp, even more light weight!

As is, this resolution is for debugging, PR testing and now, training! For our work, at this resolution, we can totally ignore all emissions!

Well, I have to do it in our CI runs for PRs. I tried ExtData and the VM just sort of said "No". Might be that 8 GB of memory isn't enough for ExtData!

Still, I might ask our group if maybe there is a way to do a "low-memory" mode of ExtData. We might have assumptions that 16 GB of memory is available, etc. and we just load things in such a way that is faster, but resource intensive.

mathomp4 commented 2 years ago

It passed location of the original error Now I can see integration

Huzzah! I guess the next step is to see if you can write out History files and checkpoints, which are the other two big IO burdens in GEOS.

(Note we can also turn both of those off as well, though at that point you are just doing performance tuning since, well, the model might be producing garbage and you'd never know! 😄 )

gmao-cda commented 2 years ago

@mathomp4 and @sanAkel Finished now!

...
----HIST                                                  1     0.004   0.00     0.004   0.00
----EXTDATA                                               1     0.004   0.00     0.004   0.00
GEOSgcm Run Status: 0

YEAH!!!!

mathomp4 commented 2 years ago

@mathomp4 and @sanAkel Finished now!

...
----HIST                                                  1     0.004   0.00     0.004   0.00
----EXTDATA                                               1     0.004   0.00     0.004   0.00
GEOSgcm Run Status: 0

YEAH!!!!

Nice! Did it produce any history output? If you ran an hour only, I'm not sure if the MOM6 history produces anything yet. @sanAkel would know

Looks like a 6 hour run would be needed? So you might do:

makeoneday.bash noext 6hr

which would set you up for a 6-hour run with no extdata.

gmao-cda commented 2 years ago

Nice! Did it produce any history output? If you ran an hour only, I'm not sure if the MOM6 history produces anything yet.

@mathomp4 I didn't change the integration time.

Here is my directory:

AGCM.rc             fvcore_internal_rst  input.nml               plot
AGCM.rc.save        fvcore_layout.rc     input.nml.save          post
archive             gcm_emip.setup       lake_internal_rst       RC
CAP.rc              gcm_run.j            landice_internal_rst    regress
CAP.rc.save         gcm_run.j.save       linkbcs                 RESTART
cap_restart         geosgcm_prog         logging.yaml            restarts
cap_restart.save    GEOSgcm.x            moist_internal_rst      scratch
catch_internal_rst  gocart_internal_rst  MOM_input               seaicethermo_internal_rst
convert             HISTORY.rc           MOM_input.save          tr_internal_rst
data_table          HISTORY.rc.save      MOM_override
diag_table          holding              openwater_internal_rst
forecasts           __init__.py          pchem_internal_rst

Oh! Santha told me to comment all the components in COLLECTIONS in HISTORY.rc Maybe I should leave some for test?

Will now run 6-hour test run

sanAkel commented 2 years ago

Still, I might ask our group if maybe there is a way to do a "low-memory" mode of ExtData. We might have assumptions that 16 GB of memory is available, etc. and we just load things in such a way that is faster, but resource intensive.

Please do @mathomp4

sanAkel commented 2 years ago

Nice! Did it produce any history output? If you ran an hour only, I'm not sure if the MOM6 history produces anything yet.

@mathomp4 I didn't change the integration time.

Here is my directory:

AGCM.rc             fvcore_internal_rst  input.nml               plot
AGCM.rc.save        fvcore_layout.rc     input.nml.save          post
archive             gcm_emip.setup       lake_internal_rst       RC
CAP.rc              gcm_run.j            landice_internal_rst    regress
CAP.rc.save         gcm_run.j.save       linkbcs                 RESTART
cap_restart         geosgcm_prog         logging.yaml            restarts
cap_restart.save    GEOSgcm.x            moist_internal_rst      scratch
catch_internal_rst  gocart_internal_rst  MOM_input               seaicethermo_internal_rst
convert             HISTORY.rc           MOM_input.save          tr_internal_rst
data_table          HISTORY.rc.save      MOM_override
diag_table          holding              openwater_internal_rst
forecasts           __init__.py          pchem_internal_rst

Oh! Santha told me to comment all the components in COLLECTIONS in HISTORY.rc Maybe I should leave some for test?

Will now run 6-hour test run

Great! @gmao-cda 😄

Look inside scratch/ for output. I forgot what we did with your diag_table it is in the experiment dir (one level below scratch, if your tree grow up!). @mathomp4 I usually don't care for "GEOS" output, MOM output is good enough! :)

gmao-cda commented 2 years ago
This is what .nc I have under scratch
(cda_suite) [cda@measures1 scratch]$ ls -al *.nc*
-rw-r--r--. 1 cda domain users    68499 Aug 24 19:52 forcing.nc
lrwxrwxrwx. 1 cda domain users       97 Aug 24 19:50 MAPL_Tripolar.nc -> /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/ocean_bcs/MOM6/72x36/MAPL_Tripolar.nc
-rw-r--r--. 1 cda domain users 16977076 Aug 24 19:51 MOM_IC.nc
-rw-r--r--. 1 cda domain users   417332 Aug 24 19:51 ocean_geometry.nc
-rw-r--r--. 1 cda domain users   127584 Aug 24 19:52 ocean.stats.nc
-rw-r--r--. 1 cda domain users  1843093 Aug 24 19:52 prog_z.nc
-rw-r--r--. 1 cda domain users   135285 Aug 24 19:52 sfc_ave.nc
-rw-r--r--. 1 cda domain users    69360 Aug 24 19:51 test5.geosgcm_budi.20000416_0300z.nc4
-rw-r--r--. 1 cda domain users    69383 Aug 24 19:52 test5.geosgcm_budi.20000416_0600z.nc4
-rw-r--r--. 1 cda domain users  1613478 Aug 24 19:52 test5.geosgcm_prog.20000416_0600z.nc4
-rw-r--r--. 1 cda domain users    12753 Aug 24 19:51 Vertical_coordinate.nc
My diag_table is
1900 1 1 0 0 0
#"scalar",   1,"days",1,"days","Time",
#"layer",    1,"days",1,"days","Time",
#"prog",     1,"days",1,"days","Time",
"prog_z",   -1,"months",1,"days","Time",
#"ave_prog", 1,"days",1,"days","Time",
#"tracer",   1,"days",1,"days","Time",
#"cont",     1,"days",1,"days","Time",
##"mom",     5,"days",1,"days","Time",
##"bt_mom",  5,"days",1,"days","Time",
#"visc",     1,"days",1,"days","Time",
##"energy",  5,"days",1,"days","Time",
"forcing",  -1,"months",1,"days","Time",
#"surface",  1,"months",1,"days","Time",
"sfc_ave",  -1,"months",1,"days","Time",

...
check ocean model output prog_z.nc
ncdump -h prog_z.nc
netcdf prog_z {
dimensions:
    xq = 72 ;
    yh = 36 ;
    z_l = 34 ;
    z_i = 35 ;
    Time = UNLIMITED ; // (1 currently)
    nv = 2 ;
    xh = 72 ;
    yq = 36 ;
variables:
    double xq(xq) ;
        xq:long_name = "q point nominal longitude" ;
        xq:units = "degrees_east" ;
        xq:cartesian_axis = "X" ;
    double yh(yh) ;
        yh:long_name = "h point nominal latitude" ;
        yh:units = "degrees_north" ;
        yh:cartesian_axis = "Y" ;
    double z_l(z_l) ;
        z_l:long_name = "Depth at cell center" ;
        z_l:units = "meters" ;
        z_l:cartesian_axis = "Z" ;
        z_l:positive = "down" ;
        z_l:edges = "z_i" ;
    double z_i(z_i) ;
        z_i:long_name = "Depth at interface" ;
        z_i:units = "meters" ;
        z_i:cartesian_axis = "Z" ;
        z_i:positive = "down" ;
    double Time(Time) ;
        Time:long_name = "Time" ;
        Time:units = "days since 1900-01-01 00:00:00" ;
        Time:cartesian_axis = "T" ;
        Time:calendar_type = "JULIAN" ;
        Time:calendar = "JULIAN" ;
        Time:bounds = "Time_bnds" ;

Seems no problem!

What an interesting debugging adventure! Nothing could be better than a successful model run for an apartment where A/C stop working. Thank you again @mathomp4 and @sanAkel

sanAkel commented 2 years ago

@gmao-cda

Oops! That (a/c) is beyond me and I am sure even @mathomp4! Hope you have a working fan! Hang in there for the night and hopefully tomorrow its fixed.

We can tag up later to go over the output and structure of the scratch/.

Can you please close this issue?

gmao-cda commented 2 years ago

Thank you very much !@mathomp4 Could I ask one last question before I close this issue?

Currently the default geosgcm run on the ESSIC server is running with 6 processes.

  1. I want to control the total number of processes to run the geosgcm, for example, with 4 processes. After checking makeoneday.bash, I tried to use flags nxy 2 2
    makeoneday.bash noext 6hr nxy 2 2
    gcm_run.j

But this raised the error when I run gcm_run.j

 Starting Threads :            1

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x2b4f617993ff in ???
#0  0x2b1cb7b403ff in ???
#1  0x25be319 in __fv_mp_mod_MOD_domain_decomp
    at /data2/cda/model/GEOSgcm/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/@FVdycoreCubed_GridComp/@fvdycore/tools/fv_mp_mod.F90:487
#1  0x25be319 in __fv_mp_mod_MOD_domain_decomp
    at /data2/cda/model/GEOSgcm/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/@FVdycoreCubed_GridComp/@fvdycore/tools/fv_mp_mod.F90:487
#2  0x276c08d in run_setup
...

I then change the decomposition as 1x4

makeoneday.bash noext 6hr nxy 1 4

Then geosgcm failed to integrate.

Since the original run uses 1x6, I change to try nxy 1 3 so that NY=3 from NY=6, dividable

makeoneday.bash noext 6hr nxy 1 3

But still failed.

Could you please give me some guidance how to set valid nxy, and are they the # of processes in each direction for the global domain? Thank you!

mathomp4 commented 2 years ago

@gmao-cda GEOS has a requirement that 6 divide NY due to the cubed-sphere dynamics (6 faces). So the smallest you can ever do is 1x6. Our old lat-lon core used to be able to do 1x1, but with FV3, 1x6 is the smallest you can ever do I'm afraid.

And if you scale up, NY needs to be 6, 12, 18, etc.

There are a few other things like you must always have 4 cells on a face of the cube so you can't run, say, C12 on 1000 cores because it's too sparse. Or you can't run things at like 4x4000 etc.

That said, it is weird we don't just trap that right away in our gcm_run script. I'll make an issue about that. I mean, we can check what NY is and die super-fast before the model runs if mod(NY,6) != 0.

gmao-cda commented 2 years ago

@mathomp4 Thank you very much for not only telling me how to set nxy but also the underlying principles why it needs to be set like that! You are a great teacher!

I'm also grateful to @sanAkel who teaches me how to run GEOSgcm!

I will close the issue now.