Closed gmao-cda closed 2 years ago
makeoneday.sh
[cda@measures1 test2]$ /data2/cda/work/TinyBCs-GitV10/scripts/makeoneday.bash
Using Git v10 directories
Making one-day experiment
TINY: TRUE
/data2/cda/work/TinyBCs-GitV10/scripts/makeoneday.bash: line 792: type: colordiff: not found
Restoring AGCM.rc.save to AGCM.rc...
Copying AGCM.rc to AGCM.rc.save...
Restoring CAP.rc.save to CAP.rc...
Copying CAP.rc to CAP.rc.save...
Restoring input.nml.save to input.nml...
Copying input.nml to input.nml.save...
Restoring MOM_input.save to MOM_input...
Copying MOM_input to MOM_input.save...
Restoring HISTORY.rc.save to HISTORY.rc...
Copying HISTORY.rc to HISTORY.rc.save...
Restoring gcm_run.j.save to gcm_run.j...
Copying gcm_run.j to gcm_run.j.save...
Restoring regress/gcm_regress.j.save to regress/gcm_regress.j...
Copying regress/gcm_regress.j to regress/gcm_regress.j.save...
Restoring RC/GEOS_ChemGridComp.rc.save to RC/GEOS_ChemGridComp.rc...
Copying RC/GEOS_ChemGridComp.rc to RC/GEOS_ChemGridComp.rc.save...
Restoring RC/GAAS_GridComp.rc.save to RC/GAAS_GridComp.rc...
Copying RC/GAAS_GridComp.rc to RC/GAAS_GridComp.rc.save...
DYN_INTERNAL_RESTART_TYPE not found. Assuming NC4
Found fvcore_internal_rst. Assuming you have needed restarts!
Changes made to CAP.rc:
9,10c9,10
< JOB_SGMT: 00000015 000000
< NUM_SGMT: 20
---
> JOB_SGMT: 00000001 000000
> NUM_SGMT: 1
25c25
< MAPL_ENABLE_TIMERS: NO
---
> MAPL_ENABLE_TIMERS: YES
Running on unknown nodes with 8 cores per node
NX from AGCM.rc (original): 1
NY from AGCM.rc (original): 6
Num of PEs from AGCM.rc (original): 6
Num of nodes from AGCM.rc (calculated): 1
Num of io nodes from AGCM.rc (original): 0
Num of nodes from AGCM.rc with ioserver (original): 1
Final number of nodes with ioserver (calculated): 1
Using minimal boundary datasets
Changes made to gcm_run.j:
7c7
< #SBATCH --time=12:00:00
---
> #SBATCH --time=0:15:00
11a12
> #SBATCH --mail-type=ALL
293,294c294,295
< setenv BCSDIR /ford1/share/gmao_SIteam/ModelData/bcs/Icarus-NLv3/Icarus-NLv3_Reynolds
< setenv CHMDIR /ford1/share/gmao_SIteam/ModelData/fvInput_nc3
---
> setenv BCSDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/bcs/Icarus-NLv3
> setenv CHMDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/chem
299,301c300,302
< setenv ABCSDIR /ford1/share/gmao_SIteam/ModelData/aogcm/atmosphere_bcs/Icarus-NLv3/MOM6/CF0012x6C_TM0072xTM0036
< setenv OBCSDIR /ford1/share/gmao_SIteam/ModelData/aogcm/ocean_bcs/MOM6/${OGCM_IM}x${OGCM_JM}
< setenv SSTDIR /ford1/share/gmao_SIteam/ModelData/aogcm/SST/MERRA2/${OGCM_IM}x${OGCM_JM}
---
> setenv ABCSDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/atmosphere_bcs/Icarus-NLv3/MOM6/CF0012x6C_TM0072xTM0036
> setenv OBCSDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/ocean_bcs/MOM6/72x36
> setenv SSTDIR /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/sst/MOM6/SST/MERRA2/72x36
318c319
< #/bin/ln -s /ford1/share/gmao_SIteam/ModelData/aogcm/MOM6/DC048xPC025_TM0072xTM0036/DC048xPC025_TM0072xTM0036-Pfafstetter.til tile_hist.data
---
> ##/bin/ln -s /ford1/share/gmao_SIteam/ModelData/aogcm/MOM6/DC048xPC025_TM0072xTM0036/DC048xPC025_TM0072xTM0036-Pfafstetter.til tile_hist.data
420c421
< if($numrs == 0) then
---
> if($numrs == 1) then
837a839
> exit
Changes made to AGCM.rc:
797c797
< CLDMICRO: 2MOMENT
---
> CLDMICRO: 1MOMENT
Changes made to regress/gcm_regress.j:
7c7
< #SBATCH --time=12:00:00
---
> #SBATCH --time=0:20:00
You seem to be using the MOM6 and portable BCs
Setting cap_restart to be in 2000
Restoring cap_restart.save to cap_restart...
Copying cap_restart to cap_restart.save...
Changes made to cap_restart:
1c1
< 20210416 000000
---
I tried to run the same test case at my home desktop, which has a 4-core cpu, but still 8GB with ESMA-baselibs=v6.2.14 and GEOSgcm=v10.22.3. Gcc=11.2 and openMPI=4.1.2
But I got the same error as shown on ESSIC server.
EXTDATA: Updating bracket for TR_LAI_FRAC
EXTDATA: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node jojo exited on signal 9 (Killed).
--------------------------------------------------------------------------
@sanAkel I'm wondering if you can provide me a log for the successful run so that I know what is the next step after EXTDATA: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
.
If the next step is still EXTDATA:
, then the problem is probably on the file ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
@sanAkel I'm wondering if you can provide me a log for the successful run so that I know what is the next step after
EXTDATA: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
.
@gmao-cda see attached log file of a healthy run. test5.log
@sanAkel got it. Thanks! Seems the next step of your successful run is
EXTDATA: INFO: Updating L bracket for TR_LAI_FRAC
EXTDATA: INFO: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
EXTDATA: INFO: Updating R bracket for TR_LAI_FRAC
EXTDATA: INFO: ... file processed: ExtData/g5chem/sfc/LAI/lai_x720_y360_v72_t12_2008.nc
Real*4 Resource Parameter: ALPHA:.000000
Real*4 Resource Parameter: BETA:1.000000
Real*4 Resource Parameter: ALPHAQ:.000000
Real*4 Resource Parameter: BETAQ:1.000000
Real*4 Resource Parameter: ALPHAO:.000000
Real*4 Resource Parameter: BETAO:1.000000
Real*4 Resource Parameter: TAUANL:21600.000000
Integer*4 Resource Parameter: IS_FCST:0
Integer*4 Resource Parameter: CONSTRAIN_DAS:1
Character Resource Parameter: ANA_IS_WEIGHTED:NO
Character Resource Parameter: EXCLUDE_ADVECTION_TRACERS:ALWAYS
FV3 is Advecting the following 48 tracers:
I don't know if things are ordered in a specific sequence for each and every one of these reads. (We use parallel reads and writes.)
@gmao-cda You could perhaps do this:
Edit your CAP.rc file.
Line 7: "JOB_SGMT:" last 6 bits are "hhmmss" formatted.
Try with this:
JOB_SGMT: 00000000 010000
⬅️ run for 1 hour.
@gmao-cda Let's try running without extdata. You can do:
makeoneday.bash noext
and that will disable extdata.
Note that with 6 cores the model can take a while to get past the ExtData step because some of these files are dumb big now.
@gmao-cda you can do both of ⬆️
Note: I have no idea about the Zoltan error. But your ulimit -s unlimited
is a good thing to have. Without that, GEOS probably would go nuts.
sigh what a model system we have!
If it can run with ExtData off, then I might need to consult other SI Team members and maybe ESMF to see if maybe there is a way to lessen the memory/disk bandwidth/whatever burden at that step.
Of course our ExtData guru is out for 10 days or so, so we might instead selectively turn off Emissions if there are some @sanAkel says aren't needed for your work.
Let's see what @gmao-cda finds out ...
@mathomp4 I like the idea of turning off and making our cheap exp, even more light weight!
As is, this resolution is for debugging, PR testing and now, training! For our work, at this resolution, we can totally ignore all emissions!
Thank you very much @mathomp4 and @sanAkel !
With makeoneday.bash noext
, it passed location of the original error and start integration
Now I can see integration
NOTE from PE 0: callTree: <--- tracer_hordiff()
NOTE from PE 0: callTree: o finished tracer advection/diffusion (step_MOM)
NOTE from PE 0: callTree: o finished calculate_diagnostic_fields (step_MOM)
NOTE from PE 0: callTree: <--- DT cycles (step_MOM)
NOTE from PE 0: callTree: o calling extract_surface_state (step_MOM)
NOTE from PE 0: callTree: ---> extract_surface_state(), MOM.F90
h-point: mean= 9.3292863637473520E+00 min= -1.8574099561546535E+00 max= 3.0265007406085243E+01 Post extract_sfc SST
h-point: c= 49436 Post extract_sfc SST
h-point: mean= 2.1727010049882391E+01 min= 0.0000000000000000E+00 max= 3.8932200628525912E+01 Post extract_sfc SSS
h-point: c= 44217 Post extract_sfc SSS
h-point: mean= -1.0647376536857116E-01 min= -1.1429856567481016E+00 max= 1.1068748770296981E+00 Post extract_sfc sea_lev
h-point: c= 87829 Post extract_sfc sea_lev
h-point: mean= 8.9762258578014347E+00 min= 0.0000000000000000E+00 max= 2.4773197056906201E+02 Post extract_sfc Hml
h-point: c= 44407 Post extract_sfc Hml
u-point: mean= 4.5827435856190596E-03 min= -1.9871638106354789E-01 max= 2.7519616525679330E-01 u Post extract_sfc SSU
u-point: c= 52478 u Post extract_sfc SSU
v-point: mean= 4.7672146971151656E-03 min= -1.4410887283895848E-01 max= 2.1435483746411085E-01 v Post extract_sfc SSU
v-point: c= 51062 v Post extract_sfc SSU
h-point: mean= 1.2301660250061899E+02 min= 0.0000000000000000E+00 max= 1.2935082730802389E+05 Post extract_sfc frazil
h-point: c= 307 Post extract_sfc frazil
NOTE from PE 0: callTree: <--- extract_surface_sfc_state()
NOTE from PE 0: callTree: <--- step_MOM()
NOTE from PE 0: callTree: <--- update_ocean_model()
I guess this is OK. During my graduate school, I don't even know NASA put all these running model codes in public. Maybe I can document all these things and write a short user guide.
Then graduate students and researchers interested in using NASA model can build on their own machines with this guide:)
Thank you again!
sigh what a model system we have!
Well, back in the GOCART1 days, we could run with GOCART.data. Unfortunately, GOCART.data is broken in GOCART 2...I think? Not sure, I wasn't sure how to run when we moved.
But if I can figure that out again, that should be a good way to make portable GEOS a bit more portable.
Let's see what @gmao-cda finds out ...
@mathomp4 I like the idea of turning off and making our cheap exp, even more light weight!
As is, this resolution is for debugging, PR testing and now, training! For our work, at this resolution, we can totally ignore all emissions!
Well, I have to do it in our CI runs for PRs. I tried ExtData and the VM just sort of said "No". Might be that 8 GB of memory isn't enough for ExtData!
Still, I might ask our group if maybe there is a way to do a "low-memory" mode of ExtData. We might have assumptions that 16 GB of memory is available, etc. and we just load things in such a way that is faster, but resource intensive.
It passed location of the original error Now I can see integration
Huzzah! I guess the next step is to see if you can write out History files and checkpoints, which are the other two big IO burdens in GEOS.
(Note we can also turn both of those off as well, though at that point you are just doing performance tuning since, well, the model might be producing garbage and you'd never know! 😄 )
@mathomp4 and @sanAkel Finished now!
...
----HIST 1 0.004 0.00 0.004 0.00
----EXTDATA 1 0.004 0.00 0.004 0.00
GEOSgcm Run Status: 0
YEAH!!!!
@mathomp4 and @sanAkel Finished now!
... ----HIST 1 0.004 0.00 0.004 0.00 ----EXTDATA 1 0.004 0.00 0.004 0.00 GEOSgcm Run Status: 0
YEAH!!!!
Nice! Did it produce any history output? If you ran an hour only, I'm not sure if the MOM6 history produces anything yet. @sanAkel would know
Looks like a 6 hour run would be needed? So you might do:
makeoneday.bash noext 6hr
which would set you up for a 6-hour run with no extdata.
Nice! Did it produce any history output? If you ran an hour only, I'm not sure if the MOM6 history produces anything yet.
@mathomp4 I didn't change the integration time.
Here is my directory:
AGCM.rc fvcore_internal_rst input.nml plot
AGCM.rc.save fvcore_layout.rc input.nml.save post
archive gcm_emip.setup lake_internal_rst RC
CAP.rc gcm_run.j landice_internal_rst regress
CAP.rc.save gcm_run.j.save linkbcs RESTART
cap_restart geosgcm_prog logging.yaml restarts
cap_restart.save GEOSgcm.x moist_internal_rst scratch
catch_internal_rst gocart_internal_rst MOM_input seaicethermo_internal_rst
convert HISTORY.rc MOM_input.save tr_internal_rst
data_table HISTORY.rc.save MOM_override
diag_table holding openwater_internal_rst
forecasts __init__.py pchem_internal_rst
Oh! Santha told me to comment all the components in COLLECTIONS in HISTORY.rc Maybe I should leave some for test?
Will now run 6-hour test run
Still, I might ask our group if maybe there is a way to do a "low-memory" mode of ExtData. We might have assumptions that 16 GB of memory is available, etc. and we just load things in such a way that is faster, but resource intensive.
Please do @mathomp4
Nice! Did it produce any history output? If you ran an hour only, I'm not sure if the MOM6 history produces anything yet.
@mathomp4 I didn't change the integration time.
Here is my directory:
AGCM.rc fvcore_internal_rst input.nml plot AGCM.rc.save fvcore_layout.rc input.nml.save post archive gcm_emip.setup lake_internal_rst RC CAP.rc gcm_run.j landice_internal_rst regress CAP.rc.save gcm_run.j.save linkbcs RESTART cap_restart geosgcm_prog logging.yaml restarts cap_restart.save GEOSgcm.x moist_internal_rst scratch catch_internal_rst gocart_internal_rst MOM_input seaicethermo_internal_rst convert HISTORY.rc MOM_input.save tr_internal_rst data_table HISTORY.rc.save MOM_override diag_table holding openwater_internal_rst forecasts __init__.py pchem_internal_rst
Oh! Santha told me to comment all the components in COLLECTIONS in HISTORY.rc Maybe I should leave some for test?
Will now run 6-hour test run
Great! @gmao-cda 😄
Look inside scratch/
for output. I forgot what we did with your diag_table
it is in the experiment
dir (one level below scratch, if your tree grow up!). @mathomp4 I usually don't care for "GEOS" output, MOM output is good enough! :)
scratch
(cda_suite) [cda@measures1 scratch]$ ls -al *.nc*
-rw-r--r--. 1 cda domain users 68499 Aug 24 19:52 forcing.nc
lrwxrwxrwx. 1 cda domain users 97 Aug 24 19:50 MAPL_Tripolar.nc -> /data2/cda/work/TinyBCs-GitV10/scripts/../../TinyBCs-GitV10/ocean_bcs/MOM6/72x36/MAPL_Tripolar.nc
-rw-r--r--. 1 cda domain users 16977076 Aug 24 19:51 MOM_IC.nc
-rw-r--r--. 1 cda domain users 417332 Aug 24 19:51 ocean_geometry.nc
-rw-r--r--. 1 cda domain users 127584 Aug 24 19:52 ocean.stats.nc
-rw-r--r--. 1 cda domain users 1843093 Aug 24 19:52 prog_z.nc
-rw-r--r--. 1 cda domain users 135285 Aug 24 19:52 sfc_ave.nc
-rw-r--r--. 1 cda domain users 69360 Aug 24 19:51 test5.geosgcm_budi.20000416_0300z.nc4
-rw-r--r--. 1 cda domain users 69383 Aug 24 19:52 test5.geosgcm_budi.20000416_0600z.nc4
-rw-r--r--. 1 cda domain users 1613478 Aug 24 19:52 test5.geosgcm_prog.20000416_0600z.nc4
-rw-r--r--. 1 cda domain users 12753 Aug 24 19:51 Vertical_coordinate.nc
diag_table
is1900 1 1 0 0 0
#"scalar", 1,"days",1,"days","Time",
#"layer", 1,"days",1,"days","Time",
#"prog", 1,"days",1,"days","Time",
"prog_z", -1,"months",1,"days","Time",
#"ave_prog", 1,"days",1,"days","Time",
#"tracer", 1,"days",1,"days","Time",
#"cont", 1,"days",1,"days","Time",
##"mom", 5,"days",1,"days","Time",
##"bt_mom", 5,"days",1,"days","Time",
#"visc", 1,"days",1,"days","Time",
##"energy", 5,"days",1,"days","Time",
"forcing", -1,"months",1,"days","Time",
#"surface", 1,"months",1,"days","Time",
"sfc_ave", -1,"months",1,"days","Time",
...
prog_z.nc
ncdump -h prog_z.nc
netcdf prog_z {
dimensions:
xq = 72 ;
yh = 36 ;
z_l = 34 ;
z_i = 35 ;
Time = UNLIMITED ; // (1 currently)
nv = 2 ;
xh = 72 ;
yq = 36 ;
variables:
double xq(xq) ;
xq:long_name = "q point nominal longitude" ;
xq:units = "degrees_east" ;
xq:cartesian_axis = "X" ;
double yh(yh) ;
yh:long_name = "h point nominal latitude" ;
yh:units = "degrees_north" ;
yh:cartesian_axis = "Y" ;
double z_l(z_l) ;
z_l:long_name = "Depth at cell center" ;
z_l:units = "meters" ;
z_l:cartesian_axis = "Z" ;
z_l:positive = "down" ;
z_l:edges = "z_i" ;
double z_i(z_i) ;
z_i:long_name = "Depth at interface" ;
z_i:units = "meters" ;
z_i:cartesian_axis = "Z" ;
z_i:positive = "down" ;
double Time(Time) ;
Time:long_name = "Time" ;
Time:units = "days since 1900-01-01 00:00:00" ;
Time:cartesian_axis = "T" ;
Time:calendar_type = "JULIAN" ;
Time:calendar = "JULIAN" ;
Time:bounds = "Time_bnds" ;
Seems no problem!
What an interesting debugging adventure! Nothing could be better than a successful model run for an apartment where A/C stop working. Thank you again @mathomp4 and @sanAkel
@gmao-cda
Oops! That (a/c) is beyond me and I am sure even @mathomp4! Hope you have a working fan! Hang in there for the night and hopefully tomorrow its fixed.
We can tag up later to go over the output and structure of the scratch/
.
Can you please close this issue?
Thank you very much !@mathomp4 Could I ask one last question before I close this issue?
Currently the default geosgcm run on the ESSIC server is running with 6 processes.
makeoneday.bash
, I tried to use flags nxy 2 2
makeoneday.bash noext 6hr nxy 2 2
gcm_run.j
But this raised the error when I run gcm_run.j
Starting Threads : 1
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
Backtrace for this error:
#0 0x2b4f617993ff in ???
#0 0x2b1cb7b403ff in ???
#1 0x25be319 in __fv_mp_mod_MOD_domain_decomp
at /data2/cda/model/GEOSgcm/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/@FVdycoreCubed_GridComp/@fvdycore/tools/fv_mp_mod.F90:487
#1 0x25be319 in __fv_mp_mod_MOD_domain_decomp
at /data2/cda/model/GEOSgcm/src/Components/@GEOSgcm_GridComp/GEOSagcm_GridComp/GEOSsuperdyn_GridComp/@FVdycoreCubed_GridComp/@fvdycore/tools/fv_mp_mod.F90:487
#2 0x276c08d in run_setup
...
I then change the decomposition as 1x4
makeoneday.bash noext 6hr nxy 1 4
Then geosgcm failed to integrate.
Since the original run uses 1x6, I change to try nxy 1 3
so that NY=3 from NY=6, dividable
makeoneday.bash noext 6hr nxy 1 3
But still failed.
Could you please give me some guidance how to set valid nxy
, and are they the # of processes in each direction for the global domain?
Thank you!
@gmao-cda GEOS has a requirement that 6 divide NY due to the cubed-sphere dynamics (6 faces). So the smallest you can ever do is 1x6. Our old lat-lon core used to be able to do 1x1, but with FV3, 1x6 is the smallest you can ever do I'm afraid.
And if you scale up, NY needs to be 6, 12, 18, etc.
There are a few other things like you must always have 4 cells on a face of the cube so you can't run, say, C12 on 1000 cores because it's too sparse. Or you can't run things at like 4x4000 etc.
That said, it is weird we don't just trap that right away in our gcm_run script. I'll make an issue about that. I mean, we can check what NY is and die super-fast before the model runs if mod(NY,6) != 0.
@mathomp4 Thank you very much for not only telling me how to set nxy
but also the underlying principles why it needs to be set like that! You are a great teacher!
I'm also grateful to @sanAkel who teaches me how to run GEOSgcm!
I will close the issue now.
Hi Matt @mathomp4 , Thank you very much for preparing me a test case to run on the ESSIC server! @sanAkel kindly taught me how to make a test run with the GEOSgcm this morning, but My GEOsgcm exited with error when I use 6 processes to run the GEOsgcm. Could you give me some suggestions? Thank you!
Building env:
The same env as shown in (https://github.com/GEOS-ESM/GEOSgcm/issues/446), which is
Running command
Error message
My debug trials
The first thing I do is to
ulimit -s unlimited
, which leads toThen I rerun the geosgcm, and the error about Zoltan does not show anymore, but the program still stops at
My guess
Full log
You can find Full log through this dropbox link here: test1_err.log