Open ekluzek opened 2 years ago
@ekluzek in case it's relevant: I wrote instructions in this google doc for creating a high-res sparse grid mesh file for #1773 . Search for "mesh" in the doc to find the relevant section.
Very briefly: 1) To start we need a file containing 1D or 2D variables of latitude and longitude for the grid of interest. If such file exists, I would be happy to try to generate the mesh file. 2) If the lat/lon file also includes a mask of the sparse grid, we would then run mesh_mask_modifier to get that mask into the mesh file.
Using the new mesh_modifier tool, I was able to get a mesh file from the domain file. The mesh file for the atm forcing is different though in that it's not modifying the 2D grid it's a simple list of 400 points. So I need to create a SCRIP grid file that describes that list of points from the domain file, and then I convert it into ESMF mesh format.
So this is what you need to do:
ncks --rgr infer --rgr scrip=scrip.nc <file_with_lat_lon_2d>.nc foo.nc
(where foo.nc contains metadata and will not be used)module load esmf
ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0
(This mesh file’s mask = 1 everywhere)
Awesome, thanks @slevisconsulting the above helped me to get mesh files created. I got everything setup Friday, but when I run the case it's failing. So I need to debug what's happening and get a working case. The mesh files I created are in...
/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17
Hopefully, the crash I'm seeing is something simple I can figure out.
The crash has to do with the connectivity on the forcing grid which is just a list of the 400 points. The suggestion from ESMF is that I make the forcing grid points vertices just be a tiny bit around the cell centers.
Because of the time it's taking to do this project I also plan to bring this work as a user-mod to master and add a test for it.
We talked about this at the standup this morning. An idea I got there was to try it with the new land mesh, but without the atm forcing mesh. I tried that and it works. So there's something going on with the new forcing that only has the 400 points. This is something I did suspect.
OK, I got a case to work! I couldn't use ncks to make the SCRIP grid file as it would "correct" my vertices to turn it into a regular grid. I was able to use curvilinear_to_SCRIP inside of NCL to write out a SCRIP grid file that I could then convert to a working mesh file. Using unstructured_to_ESMF inside of NCL didn't generate a mesh that I could use. One clue in the final mesh file I could see is that the nodeCount was 1600 (so 4x the number of points [400]) which shows that all of the points are isolated from each other. The mesh files that did NOT work all had a smaller number of total nodes than that which meant that they shared nodes between each other.
@ekluzek I understand that you used curvilinear_to_SCRIP
instead of ncks, but I didn't follow what you ran to go from SCRIP to successful MESH.
Seems like this connects with #1919 too. In general we need better documentation on how to do this.
@ekluzek and I met to compare notes (this issue #1731 vs. discussion #1919):
@ekluzek and @slevis-lmwg I tried running a sparse grid simulation using the steps we talked about on the call today. I got a bunch of ESMF errors. It seems like I should be following what @ekluzek did above?
I'm not sure how to do what you did above, Erik. Do you remember the steps you took?
My log files for the case can be found :
/glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000/run
It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.
Another thing to try would be the datm mesh file I created /glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/360x720_gswp3.0v1.c170606v4.dense400_ESMFmesh_c20221012.nc
which does differ from your file.
From reading above I ran into trouble with just using ncks, because it would change the vertices from me, so I couldn't use files created with it. I think that might be the warnings we saw when we worked on this that showed issues with the south pole.
(By the way the unclear messages from ESMF are another example of error checking that doesn't help you figure out the problem. I think this might be something where some better error checking could be done to help us figure out what's wrong).
You can also try the land mesh file I created...
/glade/work/erik/ctsm_worktrees/main_dev/cime_config/usermods_dirs/sparse_grid400_f19_mv17/fv1.9x2.5_sparse400_181205v4_ESMFmesh_c20220929.nc
NOTE: That for this you would replace use it for
ATM_DOMAIN_MESH LND_DOMAIN_MESH
And leave MASK_MESH as it was before.
OR -- you would reverse the mask and use it for MASK_MESH, and leave LND_DOMAIN_MESH/ATM_DOMAIN_MESH as they were.
I set it up changing ATM_DOMAIN_MESH/LND_DOMAIN_MESH because that is what made sense to me. But, as we saw with talking with @slevis-lmwg it's more general and simpler to swap out the MASK_MESH file.
That mesh file is also different from yours, and it's not just the mask. So again maybe there was something going on with the ncks conversion?
@adrifoster I do see four posts up that I wrote, "We think that his ncks attempt failed because he applied it to a "domain" file" ...followed by a suggestion how to resolve. So the "domain" shortcut probably did not work.
If things continue to fail, let's meet again and go through the full process, as I recommend it above. Let's plan on an hour.
It looks like the issue is in datm, from the PET file as you point out. So one thing to check would be to see if just the change for the MASK_MESH works. I think that should, so that would be good to try.
Okay I removed the info in user_nl_datm_streams
and submit that (so the only thing that was changed was MASK_MESH
. That failed for a different reason...
56: Which are NaNs = F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F T F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
56: F F F F F F F F F F F F F F F F F F F F F F F F
56: NaN found in field Sl_lfrin at gridcell index 88
56: ERROR: ERROR: One or more of the CTSM cap export_1D fields are NaN
87: # of NaNs = 3
87: Which are NaNs = F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F T F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F T
87: T F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F F
87: F F F F F F F F F F F F F F F F F F F F F F F F
87: NaN found in field Sl_lfrin at gridcell index 60
87: NaN found in field Sl_lfrin at gridcell index 111
87: NaN found in field Sl_lfrin at gridcell index 112
Will try your next suggestion next.
That still failed... thanks @slevis-lmwg for joining meeting with @mvertens this afternoon. @ekluzek do you want to join as well?
@adrifoster I will post my process here. Afterwards we can clean it up and post in #1919.
1) Starting with this file /glade/u/home/forrest/ppe_representativeness/output_v4/clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc
generate landmask.nc:
a) In matlab (% are comments)
rcent = ncread('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','rcent');
landmask = round(~isnan(rcent)); % "~" means "not"
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','landmask',landmask);
nccreate('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props','Dimensions',{'lon',144,'lat',96})
ncwrite('clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc','mod_lnd_props',landmask);
OOPS... Somehow I had write permissions and was able to add the new variables directly to the clusters file. I hope I didn't mess anything up for anyone!
mv clusters.clm51_PPEn02ctsm51d021_2deg_GSWP3V1_leafbiomassesai_PPE3_hist.annual+sd.400.nc landmask.nc
ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
/glade/u/apps/ch/opt/esmf-netcdf/8.0.0/intel/19.0.5/bin/bing/Linux.intel.64.mpiuni.default/ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0
@adrifoster with the above, I am still attempting a shortcut: I generated the mesh file directly from the landmask file, in order to skip the mesh_mask_modifier step.
In the run where you point to the default atmosphere drivers (not the sparse version), set the three mesh paths to /glade/scratch/slevis/temp_work/sparse_grid/lnd_mesh.nc
@adrifoster says that the above seems to have worked, so we will next try to apply the sparse grid on the datm data, as well.
From discussing with @olyson:
The 1D datm domain file came from combining the 2D gswp3 data and the dense400 mask. So... in matlab I will take the 1D mask from the domain file
/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc
and put it in a 2D mask in a 2D gswp3 file
/glade/p/cgd/tss/CTSM_datm_forcing_data/atm_forcing.datm7.GSWP3.0.5d.v1.1.c181207/TPHWL/clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc
That will be our starting file for the rest of the steps.
In matlab:
longxy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LONGXY');
latixy = ncread('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','LATIXY');
lon = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
lat = ncread('../../../people/oleson/modify_domain/domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');
mask = zeros(720,360);
for cell = 1:400
[i,j] = find(latixy==lat(cell) & longxy==lon(cell));
mask(i,j) = 1;
end
In a copy of the atmosphere file, so as to avoid overwriting the original, still in matlab:
nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask','Dimensions',{'lon',720,'lat',360})
ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','landmask',mask);
nccreate('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props','Dimensions',{'lon',720,'lat',360})
ncwrite('clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc','mod_lnd_props',mask);
After matlab
mv clmforc.GSWP3.c2011.0.5x0.5.TPQWL.2014-01.nc landmask.nc
ncks --rgr infer --rgr scrip=scrip.nc landmask.nc foo.nc
/glade/u/apps/ch/opt/esmf-netcdf/8.0.0/intel/19.0.5/bin/bing/Linux.intel.64.mpiuni.default/ESMF_Scrip2Unstruct scrip.nc lnd_mesh.nc 0
@adrifoster this mesh, intended for datm, is located here:
/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh
You already have the other one, but I should mention that I moved it here:
/glade/scratch/slevis/temp_work/sparse_grid/land_mesh
@adrifoster I hope the run still works when you now point to the dense400 datm files and the datm mesh that I generated (previous post).
Unfortunately that did not work.
see logs /glade/scratch/afoster/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/run
case is /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest
Thoughts: 1) I think I see the problem:
mask = 1
according to the sparse grid (and 0 elsewhere), BUT for all the grid cells of the 360x720 domain (so it has 259920 elements).
2) So the next thing I would suggest, in case it works:
Point to the default global data (as in the case that worked) but point to the datm_mesh/lnd_mesh.nc file that I generated, rather than the default datm mesh. If NUOPC does with datm and the mesh what it does with fsurdat files and the land mesh, then this should work.
3) Another idea (going back to the 400-element datm data) might be to take the datm_mesh/lnd_mesh.nc that I generated and come up with a 400 element version of it that contains only the data for mask = 1. This may or may not be easy. I could play with the file in matlab and see what I get. I don't know if I need to end up with exactly the same order of elements that appears in the datm files. I think NUOPC is smart enough to figure out the mapping, but maybe not...
4) @adrifoster if (2) doesn't work, and you need to start simulations soon or now, I suggest starting them with the configuration that worked, because it may take longer to resolve this latest problem...@adrifoster I will wait to hear whether (2) works and, if so, whether it's an acceptable solution before I start on (3).
Okay thanks! I am testing now whether adding just:
CLMGSWP3v1.TPQW:mapalgo = nn
CLMGSWP3v1.TPQW:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc
CLMGSWP3v1.Solar:mapalgo = nn
CLMGSWP3v1.Solar:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc
CLMGSWP3v1.Precip:mapalgo = nn
CLMGSWP3v1.Precip:meshfile =/glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh.nc
to user_nl_datm_streams
will work
This seems to be working! Though it is quite slow... I'm not sure why
Hmm, does it seem slower that not using the datm_mesh file?
Yes, the intention is for the file to be faster, because it's doing less reading and no interpolation. If it's slower that's defeating the purpose of doing this part.
actually hold on that... I think i forgot to turn debug off.
Okay so it didn't really seem to change the timing much. In fact the one that used the mesh file for the DATM data was a bit slower:
Timing File for Case WITH DATM Mesh:
cat cesm_timing.ctsm51FATES_SP_OAAT_Control_2000_nuopctest.3834956.chadmin1.ib0.cheyenne.ucar.edu.231012-162352
---------------- TIMING PROFILE ---------------------
Case : ctsm51FATES_SP_OAAT_Control_2000_nuopctest
LID : 3834956.chadmin1.ib0.cheyenne.ucar.edu.231012-162352
Machine : cheyenne
Caseroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest
Timeroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest/Tools
User : afoster
Curr Date : Thu Oct 12 17:02:58 2023
Driver : CMEPS
grid : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
compset : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
run type : startup, continue_run = TRUE (inittype = FALSE)
stop option : nyears, stop_n = 10
run length : 3650 days (3650.0 for ocean)
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 72 36 72 x 1 1 (1 )
atm = datm 36 0 36 x 1 1 (1 )
lnd = clm 72 36 72 x 1 1 (1 )
ice = sice 72 36 72 x 1 1 (1 )
ocn = socn 72 36 72 x 1 1 (1 )
rof = srof 72 36 72 x 1 1 (1 )
glc = sglc 72 36 72 x 1 1 (1 )
wav = swav 72 36 72 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )
total pes active : 108
mpi tasks per node : 36
pe count for cost estimate : 108
Overall Metrics:
Model Cost: 7.01 pe-hrs/simulated_year
Model Throughput: 369.95 simulated_years/day
Init Time : 6.237 seconds
Run Time : 2335.434 seconds 0.640 seconds/day
Final Time : 1.449 seconds
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 2335.434 seconds 0.640 seconds/mday 369.95 myears/wday
CPL Run Time: 282.852 seconds 0.077 seconds/mday 3054.60 myears/wday
ATM Run Time: 1108.467 seconds 0.304 seconds/mday 779.45 myears/wday
LND Run Time: 1308.653 seconds 0.359 seconds/mday 660.22 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 1110.305 seconds 0.304 seconds/mday 778.16 myears/wday
NOTE: min:max driver timers (seconds/day):
CPL (pes 36 to 107)
ATM (pes 0 to 35)
LND (pes 36 to 107)
ICE (pes 36 to 107)
OCN (pes 36 to 107)
ROF (pes 36 to 107)
GLC (pes 36 to 107)
WAV (pes 36 to 107)
ESP (pes 0 to 0)
Timing file for case WITHOUT DATM mesh:
cat cesm_timing.ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim.3833019.chadmin1.ib0.cheyenne.ucar.edu.231012-153640
---------------- TIMING PROFILE ---------------------
Case : ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim
LID : 3833019.chadmin1.ib0.cheyenne.ucar.edu.231012-153640
Machine : cheyenne
Caseroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim
Timeroot : /glade/work/afoster/FATES_calibration/FATES_SP_OAAT/ctsm51FATES_SP_OAAT_Control_2000_nuopctest_noclim/Tools
User : afoster
Curr Date : Thu Oct 12 16:14:10 2023
Driver : CMEPS
grid : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
compset : 2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV_SESP
run type : startup, continue_run = TRUE (inittype = FALSE)
stop option : nyears, stop_n = 10
run length : 3650 days (3650.0 for ocean)
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 72 36 72 x 1 1 (1 )
atm = datm 36 0 36 x 1 1 (1 )
lnd = clm 72 36 72 x 1 1 (1 )
ice = sice 72 36 72 x 1 1 (1 )
ocn = socn 72 36 72 x 1 1 (1 )
rof = srof 72 36 72 x 1 1 (1 )
glc = sglc 72 36 72 x 1 1 (1 )
wav = swav 72 36 72 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )
total pes active : 108
mpi tasks per node : 36
pe count for cost estimate : 108
Overall Metrics:
Model Cost: 6.72 pe-hrs/simulated_year
Model Throughput: 385.70 simulated_years/day
Init Time : 5.998 seconds
Run Time : 2240.084 seconds 0.614 seconds/day
Final Time : 1.129 seconds
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 2240.084 seconds 0.614 seconds/mday 385.70 myears/wday
CPL Run Time: 281.403 seconds 0.077 seconds/mday 3070.33 myears/wday
ATM Run Time: 1009.835 seconds 0.277 seconds/mday 855.59 myears/wday
LND Run Time: 1311.185 seconds 0.359 seconds/mday 658.95 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 1044.302 seconds 0.286 seconds/mday 827.35 myears/wday
NOTE: min:max driver timers (seconds/day):
CPL (pes 36 to 107)
ATM (pes 0 to 35)
LND (pes 36 to 107)
ICE (pes 36 to 107)
OCN (pes 36 to 107)
ROF (pes 36 to 107)
GLC (pes 36 to 107)
WAV (pes 36 to 107)
ESP (pes 0 to 0)
@adrifoster - what kind of timings were you getting with MCT? @jedwards4b - why are we seeing about 385 ypd - when the atm and land are running concurrently? Not sure where the cpl comm time is coming in. Can you have a look at this?
At @ekluzek's suggestion I am running a set of tests with MCT vs. nuopc; full vs. sparse grid; and full DATM vs. subset to see the differences in timing. Please hold on those results
In the meantime here is an older timing file I have for an old MCT run:
---------------- TIMING PROFILE ---------------------
Case : ctsm51FATES_PPE_Control_2000
LID : 2267387.chadmin1.ib0.cheyenne.ucar.edu.230725-150029
Machine : cheyenne
Caseroot : /glade/work/afoster/CLM_PPE_FATES/ctsm51FATES_PPE_Control_2000
Timeroot : /glade/work/afoster/CLM_PPE_FATES/ctsm51FATES_PPE_Control_2000/Tools
User : afoster
Curr Date : Tue Jul 25 15:30:36 2023
Driver : CPL7
grid : a%1.9x2.5_l%1.9x2.5_oi%null_r%null_g%null_w%null_z%null_m%gx1v7
compset : 2000_DATM%CRUv7_CLM51%FATES-SP_SICE_SOCN_SROF_SGLC_SWAV_SIAC_SESP
run type : startup, continue_run = FALSE (inittype = TRUE)
stop option : nyears, stop_n = 10
run length : 3650 days (3649.9791666666665 for ocean)
component comp_pes root_pe tasks x threads instances (stride)
--------- ------ ------- ------ ------ --------- ------
cpl = cpl 72 36 72 x 1 1 (1 )
atm = datm 36 0 36 x 1 1 (1 )
lnd = clm 72 36 72 x 1 1 (1 )
ice = sice 72 36 72 x 1 1 (1 )
ocn = socn 72 36 72 x 1 1 (1 )
rof = srof 72 36 72 x 1 1 (1 )
glc = sglc 72 36 72 x 1 1 (1 )
wav = swav 72 36 72 x 1 1 (1 )
iac = siac 1 0 1 x 1 1 (1 )
esp = sesp 1 0 1 x 1 1 (1 )
total pes active : 108
mpi tasks per node : 36
pe count for cost estimate : 108
Overall Metrics:
Model Cost: 5.29 pe-hrs/simulated_year
Model Throughput: 489.81 simulated_years/day
Init Time : 37.219 seconds
Run Time : 1763.942 seconds 0.483 seconds/day
Final Time : 0.001 seconds
Actual Ocn Init Wait Time : 0.000 seconds
Estimated Ocn Init Run Time : 0.000 seconds
Estimated Run Time Correction : 0.000 seconds
(This correction has been applied to the ocean and total run times)
Runs Time in total seconds, seconds/model-day, and model-years/wall-day
CPL Run Time represents time in CPL pes alone, not including time associated with data exchange with other components
TOT Run Time: 1763.942 seconds 0.483 seconds/mday 489.81 myears/wday
CPL Run Time: 53.851 seconds 0.015 seconds/mday 16044.27 myears/wday
ATM Run Time: 511.348 seconds 0.140 seconds/mday 1689.65 myears/wday
LND Run Time: 1362.955 seconds 0.373 seconds/mday 633.92 myears/wday
ICE Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
OCN Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ROF Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
GLC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
WAV Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
IAC Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
ESP Run Time: 0.000 seconds 0.000 seconds/mday 0.00 myears/wday
CPL COMM Time: 1248.110 seconds 0.342 seconds/mday 692.25 myears/wday
NOTE: min:max driver timers (seconds/day):
CPL (pes 36 to 107)
ATM (pes 0 to 35)
LND (pes 36 to 107)
ICE (pes 36 to 107)
OCN (pes 36 to 107)
ROF (pes 36 to 107)
GLC (pes 36 to 107)
WAV (pes 36 to 107)
IAC (pes 0 to 0)
ESP (pes 0 to 0)
CPL:CLOCK_ADVANCE 0.005: 0.006
CPL:LNDPREP 0.001: 0.014
CPL:C2L <----> 0.003: 0.012
CPL:LND_RUN 0.179: 0.373
CPL:L2C 7.073: 559.467
CPL:LNDPOST 0.000: 0.000
CPL:FRACSET 0.000: 0.001
CPL:ATM_RUN 0.136: 0.140
CPL:A2C <----> 0.089: 0.342
CPL:ATMPOST 0.000: 0.000
CPL:RESTART 0.000: 0.000
CPL:HISTORY 0.000: 0.000
CPL:TSTAMP_WRITE 0.000: 0.000
CPL:TPROF_WRITE 0.000: 0.000
CPL:RUN_LOOP_BSTOP 0.000: 0.000
More info on coupler timing:
CPL:LNDPREP 0.001: 0.014
CPL:lndprep_atm2lnd 0.000: 0.013
CPL:lndprep_mrgx2l 0.000: 0.000
CPL:LNDPOST 0.000: 0.000
CPL:ATMPOST 0.000: 0.000
So what I see with above is that the cost and throughput overall improve by just 4%. That comes from about a 10% improvement in DATM.
Since, the difference in the DATM gridpoints should be about a quarter of a million vs 400, I would expect a MUCH larger difference for DATM. So I think we must NOT really be coupling with the 400 point grid -- but a half degree grid. It's faster because there are fewer points to read and interpolate from. But, it must be taking the 400 points and doing nearest neighbor interpolation to the f09 grid which it then gives to CLM. What we want to pass between DATM and CLM in the CPL is just the 400 points. If we can get it setup that way, that should make a marked difference in timing.
Since, the difference in the DATM gridpoints should be about a quarter of a million vs 400, I would expect a MUCH larger difference for DATM. So I think we must NOT really be coupling with the 400 point grid -- but a half degree grid.
do you mean in the case were we use the full DATM files but the mesh is the 400 pt lnd_mesh
?
it's actually faster for the one we don't use the updated DATM mesh. Or am I reading that wrong?
The top case is the one with 400 point mesh for DATM and CLM right? And the second one is 400 points for CLM, but full grid for DATM. And with that I got them exactly backwards. Thanks for the correction!
The top case has...
Model Cost: 7.01 pe-hrs/simulated_year
Model Throughput: 369.95 simulated_years/day
ATM Run Time: 1108.467 seconds 0.304 seconds/mday 779.45 myears/wday
And the second is:
Overall Metrics: Model Cost: 6.72 pe-hrs/simulated_year Model Throughput: 385.70 simulated_years/day ATM Run Time: 1009.835 seconds 0.277 seconds/mday 855.59 myears/wday
So the top is both more expensive, and at slower throughput overall and even for DATM.
I totally don't get why the top is slower than the bottom. You might run the same thing a few times to see what the machine spread looks like. It could be the difference is small enough that machine randomness happened to come into play here. If you can double check your cases to make sure, and then maybe we should all look at them.
(Moral always show your work...)
On having the coupler only send the 400 points rather than the full f09 grid -- we'd need to set it up with a mesh that only includes the 400 points. That's a bit different than we were thinking about this here....
Okay so now my runs with NUOPC on the sparse grid are failing again...
Is this because i'm using an older CTSM tag?
/glade/scratch/afoster/nuopc_grid_fullDATM/run
and /glade/scratch/afoster/nuopc_grid_400DATM/run
@adrifoster I'm surprised that any version works with that mesh file - it's netcdf4 and needs to be converted to a cesm compatible format. I've done that in /glade/scratch/jedwards/PFS.f19_g17.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.20231013_131729_eze25v/run/lnd_mesh.nc
Oh that's weird, they did work before... trying with your update, thank you!
@adrifoster I am running for 1 year and see the run rate at the beginning of the run rate = 450.37 ypd, but there is a steady decline over the course of the run and it's only rate = 52.47 ypd at the end - have you had that same experience?
I have not noticed that. Are you running with fates_sp=.true.? In the user_nl_clm file?
On Fri, Oct 13, 2023 at 3:31 PM Jim Edwards @.***> wrote:
@adrifoster https://github.com/adrifoster I am running for 1 year and see the run rate at the beginning of the run rate = 450.37 ypd, but there is a steady decline over the course of the run and it's only rate = 52.47 ypd at the end - have you had that same experience?
— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1731#issuecomment-1762249238, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE42IVWCFQOVIZXAV2TA2LX7GXJVANCNFSM5UXJD46Q . You are receiving this because you were mentioned.Message ID: @.***>
That made a huge difference - thank you!
My two cases are /glade/scratch/jedwards/PFS.f19_g17.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.20231013_131729_eze25v
and /glade/scratch/jedwards/PFS.f19_g17.2000_DATM%GSWP3v1_CLM51%FATES_SICE_SOCN_SROF_SGLC_SWAV.cheyenne_intel.20231013_133543_9gysi0
The first one (eze25v) is using the new lnd_mesh.nc files. The other (9gysi0) is using the default mesh files. Otherwise the two cases are the same.
Performance is markedly different:
Model Cost: 7.21 pe-hrs/simulated_year
Model Throughput: 599.37 simulated_years/day
vs
Model Cost: 38.53 pe-hrs/simulated_year
Model Throughput: 156.98 simulated_years/day
Also sorry my other case (nuopc_grid_400DATM) failed I think because I need to do the same thing you did to the lnd_mesh.nc file to the DATM mesh I reference in my user_nl_datm_streams file.
What did you do to make the file compatible?
On Fri, Oct 13, 2023 at 4:39 PM Jim Edwards @.***> wrote:
My two cases are
— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1731#issuecomment-1762326858, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADE42ISU6OHVY5CFUQTIH2LX7G7L5ANCNFSM5UXJD46Q . You are receiving this because you were mentioned.Message ID: @.***>
@adrifoster a simple way to convert files away from NetCDF-4 is to use nccopy. I usually convert to CDF5 format.
nccopy -k cdf5
You can also check a NetCDF's file format with the "-k" option like
ncdump -k
Okay here are the results:
driver | land grid | DATM grid | cost | throughput |
---|---|---|---|---|
mct | full | full DATM | 53.73 | 659.32 |
mct | grid | full DATM | 6.2 | 418.35 |
mct | grid | grid DATM | 3.56 | 676.56 |
nuopc | full | full DATM | 67.502 | 525.9 |
nuopc | grid | full DATM | 6.61 | 391.9 |
nuopc | grid | grid DATM | 6.73 | 385.34 |
so the gridded DATM in MCT seems to results in almost half as fast runs
But for NUOPC we aren't seeing that yet. likely becauase as @ekluzek we are still reading in the full DATM data.
I modified the mesh file from global with 259200 elements and mask = 1 at the 400 cells of the sparse grid to a 400-element vector same as the domain and datm files that @adrifoster was using in the mct version of this work.
In matlab, I read variables from the global datm_mesh/lnd_mesh.nc file and the 400-element domain file...
elementMask = ncread('lnd_mesh.nc','elementMask');
elementArea = ncread('lnd_mesh.nc','elementArea');
centerCoords = ncread('lnd_mesh.nc','centerCoords');
numElementConn = ncread('lnd_mesh.nc','numElementConn');
elementConn = ncread('lnd_mesh.nc','elementConn');
nodeCoords = ncread('lnd_mesh.nc','nodeCoords');
xc = ncread('/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','xc');
yc = ncread('/glade/p/cgd/tss/people/oleson/modify_domain/domain.lnd.360x720_domain.lnd.360x720_gswp3.0v1.c170606.dense400.nc','yc');
for cell = 1:400
[i] = find(centerCoords(1,:) == xc(cell) & centerCoords(2,:) == yc(cell));
centerCoords_new(:,cell) = centerCoords(:,i);
elementArea_new(cell,:) = elementArea(i,:);
elementConn_new(:,cell) = elementConn(:,i);
elementMask_new(cell,:) = elementMask(i,:);
nodeCoords_new(:,cell) = nodeCoords(:,i);
numElementConn_new(cell) = numElementConn(i);
end
nccreate('lnd_mesh_400.nc','centerCoords','Dimensions',{'coordDim', 2, 'elementCount', 400})
ncwrite('lnd_mesh_400.nc','centerCoords',centerCoords_new)
nccreate('lnd_mesh_400.nc','elementArea','Dimensions',{'elementCount', 400})
ncwrite('lnd_mesh_400.nc','elementArea',elementArea_new);
nccreate('lnd_mesh_400.nc','elementConn','Dimensions',{'maxNodePElement', 4, 'elementCount', 400})
ncwrite('lnd_mesh_400.nc','elementConn',elementConn_new);
nccreate('lnd_mesh_400.nc','elementMask','Dimensions',{'elementCount', 400})
ncwrite('lnd_mesh_400.nc','elementMask',elementMask_new);
nccreate('lnd_mesh_400.nc','nodeCoords','Dimensions',{'coordDim', 2, 'nodeCount', 400})
ncwrite('lnd_mesh_400.nc','nodeCoords',nodeCoords_new);
nccreate('lnd_mesh_400.nc','numElementConn','Dimensions',{'elementCount', 400})
ncwrite('lnd_mesh_400.nc','numElementConn',numElementConn_new);
New file is /glade/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh_400.nc @adrifoster please try this new file. This reaches my limit for now. If this one does not work, then I will wait until Wednesday's meeting before trying anything else.
Remembered to convert to cdf5 after I read the following comments...
@adrifoster you will need to convert this before using it in cesm. ncdump -k /glade/cheyenne/scratch/slevis/temp_work/sparse_grid/datm_mesh/lnd_mesh_400.nc netCDF-4 classic model
@slevis-lmwg I'm pretty sure that there is an option in matlab to create cdf5 (NETCDF3_64bit_data) files.
@adrifoster just a suggestion. For testing performance there is a PFS test that we use. You might setup one of those and compare to your case to do similar things.
The one thing that that test does that I know about is to turn off history and restart writing (by only running 20 days). This is helpful to reduce machine randomness that can be high for I/O. The PFS test might do other things as well.
Then of course all the simulations would need to be run over again. I expect the end result to look similar, but good to do...
We need a mesh file that can be used with the NUOPC coupler for the sparse grid.
Here's a sample case for the MCT coupler:
/glade/work/oleson/PPE.n11_ctsm5.1.dev030/cime/scripts/ctsm51c8BGC_PPEn11ctsm51d030_2deg_GSWP3V1_Sparse400_Control_2000